RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

Chuang Zhu; Donghong Jiang; Endian Lin; Hanqing Liu; Luoping Cui; Mingjie Liu

arxiv: 2605.19329 · v2 · pith:TBWQ7MEGnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

Hanqing Liu , Mingjie Liu , Luoping Cui , Endian Lin , Donghong Jiang , Chuang Zhu This is my paper

Pith reviewed 2026-05-22 10:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords event-augmented vision-language modeldual-stream architecturescene understandingevent camerasRGB-Event datasetsgraph-driven pipelinevisual question answeringsynthetic data generation

0 comments

The pith

RE-VLM pairs RGB images with event camera streams in a dual-stream model to improve scene understanding when lighting or motion makes standard frames unreliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RE-VLM as a vision-language model that runs separate encoders on RGB images and event streams, then aligns the combined features with language through progressive training. Conventional models lose accuracy in low light, high dynamic range, or fast motion because RGB data degrades; event cameras record brightness changes at high speed and wide range, supplying the missing cues. To overcome the lack of suitable training data, the authors introduce a graph-driven pipeline that turns synchronized RGB-Event input into scene graphs and then into synthetic captions and question-answer pairs. Two new datasets, PEOD-Chat and RGBE-Chat, are built to cover illumination-challenged and general scenes. Benchmarks show the model outperforms both RGB-only and event-only baselines of similar size, with the largest gains appearing precisely where RGB data is weakest.

Core claim

RE-VLM is the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. It employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, a graph-driven pipeline converts synchronized RGB-Event streams into verifiable scene graphs, from which captions and QA pairs are synthesized. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large

What carries the argument

Dual-stream architecture of parallel RGB and event encoders whose outputs are progressively aligned to language, supported by a graph-driven pipeline that turns synchronized streams into scene graphs and synthetic supervision.

If this is right

Scene understanding tasks such as captioning and visual question answering become more reliable when RGB data is degraded by low light or rapid motion.
Training can proceed without large-scale manual annotation once synchronized RGB-Event streams are available to drive the graph pipeline.
Progressive alignment allows heterogeneous visual streams to be fused without forcing one modality to dominate the other.
Performance gains are largest exactly in the conditions where conventional RGB-only models lose accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-driven synthesis step could be reused to generate training data for other multimodal tasks that combine asynchronous sensors with text.
Extending the dual-stream design to include additional modalities such as depth or thermal data would be a direct next step for further robustness.
Real-time applications that must operate across varying illumination would benefit from the event stream's low-latency motion cues once the alignment cost is paid at training time.

Load-bearing premise

The graph-driven pipeline reliably turns synchronized RGB-Event streams into verifiable scene graphs and high-quality synthetic captions or QA pairs without adding significant noise or bias.

What would settle it

A controlled test in which the same model architecture trained on the synthesized captions and QA pairs performs no better than an identical model trained on randomly generated or heavily noisy labels, when both are evaluated on held-out challenging scenes.

Figures

Figures reproduced from arXiv: 2605.19329 by Chuang Zhu, Donghong Jiang, Endian Lin, Hanqing Liu, Luoping Cui, Mingjie Liu.

**Figure 2.** Figure 2: Construction of RE-VLM: data generation pipeline and model. Left: A graph-driven pipeline converts synchronized RGB frames and event streams into a graph, extracts verifiable scene facts, and synthesizes reliable caption and QA supervision. Center: Representative examples from the datasets yielded by the pipeline: PEOD-Chat (illumination-challenged scenes) and RGBE-Chat (general scenarios). Right: The RE-V… view at source ↗

**Figure 3.** Figure 3: Data generation pipeline overview. From reconstructed event frames and RGB images, two modality-specific graphs are constructed. A degradation-aware fusion then merges them into a single RGB-event graph (nodes: entities, edges: relations). Finally, captions and VQA items are synthesized from the fused graph. (S: subject, P: place, D: direction, T: target; H: hierarchical relation; A: attribute.) attributes… view at source ↗

**Figure 4.** Figure 4: RE-VLM model architecture. Synchronized RGB and event streams are encoded. During [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Training pipeline. Three compact stages: (1) Initial event– language alignment, (2) Align the event and RGB modalities with STAM, (3) End-to-end instruction tuning. We adopt a concise three-stage curriculum that first aligns event representations with language, then aligns it with the RGB representation via STAM, and finally performs lightweight instruction tuning on the LLM. Stage 1: Event-Language alignm… view at source ↗

**Figure 6.** Figure 6: Qualitative VQA comparison in an overexposed traffic [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RE-VLM proposes a dual-stream RGB-event VLM with a graph-based synthetic data pipeline and two new datasets, but the abstract supplies no metrics or validation of the generated supervision.

read the letter

The paper's core move is to add event streams to a VLM through parallel encoders and a progressive alignment step, then fill the missing RGB-event-text data with scene graphs that generate captions and QA pairs. They release PEOD-Chat for illumination problems and RGBE-Chat for broader coverage, and claim better captioning and VQA results than RGB-only or event-only baselines, especially in hard conditions. That direction makes sense for robotics or autonomous systems where standard cameras drop out in low light or fast motion.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes RE-VLM, the first dual-stream vision-language model that jointly processes RGB images and event camera streams for robust scene understanding under both normal and adverse conditions (low light, high dynamic range, fast motion). It uses parallel RGB and event encoders with a progressive training strategy to align heterogeneous visual features to language. To mitigate the lack of RGB-Event-Text data, the authors introduce a graph-driven pipeline that converts synchronized streams into scene graphs from which captions and QA pairs are synthesized, yielding the PEOD-Chat and RGBE-Chat datasets. Experiments on captioning and VQA benchmarks report consistent outperformance over RGB-only and event-only baselines with comparable parameter counts, with larger gains under challenging conditions.

Significance. If the empirical claims are substantiated, the work would constitute a useful extension of vision-language models to event-based sensing, exploiting the complementary strengths of event cameras in temporal resolution and dynamic range. The graph-driven synthetic supervision pipeline offers a practical method for generating training data in a new multimodal setting. The emphasis on comparable parameter counts and gains in adverse conditions suggests potential applicability to real-world robust perception tasks.

major comments (1)

[Abstract and Data Synthesis Pipeline] Abstract / Data Synthesis Pipeline: the assertion that the graph-driven pipeline produces 'verifiable scene graphs' and 'high-quality synthetic captions/QA pairs' that supply effective supervision is not accompanied by any quantitative validation (e.g., human agreement rates, noise or bias metrics, or ablation comparing performance on synthetic versus real data). Because the central outperformance claim, especially under challenging conditions, rests on the quality of this synthetic supervision, the absence of such checks is load-bearing and must be addressed.

minor comments (1)

[Abstract] Abstract: the statement of 'consistent outperformance' and 'particularly large gains' is presented without any numerical metrics, baseline identifiers, or statistical details, which reduces the standalone informativeness of the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate the suggested validations into the revised version to strengthen the claims regarding our data synthesis pipeline.

read point-by-point responses

Referee: Abstract / Data Synthesis Pipeline: the assertion that the graph-driven pipeline produces 'verifiable scene graphs' and 'high-quality synthetic captions/QA pairs' that supply effective supervision is not accompanied by any quantitative validation (e.g., human agreement rates, noise or bias metrics, or ablation comparing performance on synthetic versus real data). Because the central outperformance claim, especially under challenging conditions, rests on the quality of this synthetic supervision, the absence of such checks is load-bearing and must be addressed.

Authors: We agree that the initial manuscript would benefit from explicit quantitative validation of the graph-driven pipeline. While the scene graphs are constructed using established object detection and relation extraction techniques applied to synchronized RGB-Event streams (ensuring verifiability by construction from the input modalities), we did not report human agreement rates, noise/bias metrics, or a direct ablation of synthetic versus real supervision in the submitted version. In the revision, we will add these elements: (1) human evaluation results on a sampled subset of generated scene graphs, captions, and QA pairs with inter-annotator agreement scores; (2) quantitative metrics assessing noise and potential biases in the synthetic data; and (3) an ablation study comparing RE-VLM performance when trained with the synthetic supervision versus any available real RGB-Event-Text pairs. These additions will directly substantiate the quality of the supervision and its contribution to the observed gains, particularly under challenging conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with independent experimental claims

full rationale

The paper proposes a dual-stream VLM architecture and a graph-driven pipeline for synthesizing captions/QA pairs from RGB-Event streams, then evaluates performance on constructed datasets via direct comparison to RGB-only and event-only baselines. No mathematical derivations, equations, or fitted parameters are described in the abstract or provided text that could reduce a claimed result to its own inputs by construction. Claims of outperformance rest on empirical benchmarks rather than self-referential definitions, self-citation chains, or renamed known results. The absence of quantitative validation for synthetic data quality is a methodological limitation but does not constitute circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about event-camera complementarity and the fidelity of the graph-based data generation process.

pith-pipeline@v0.9.0 · 5782 in / 1109 out tokens · 41207 ms · 2026-05-22T10:13:08.004524+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DDD17: End-To-End DAVIS Driving Dataset

Jonathan Binas, Daniel Neil, Shih-Chii Liu, and Tobi Del- bruck. Ddd17: End-to-end davis driving dataset.arXiv preprint arXiv:1711.01458, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

M3ed: Multi-robot, multi-sensor, multi-environment event dataset

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4016–4023, 2023. 5

work page 2023
[4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2, 7

work page 2024
[5]

Segment any event streams via weighted adaptation of pivotal tokens

Zhiwen Chen, Zhiyu Zhu, Yifan Zhang, Junhui Hou, Guang- ming Shi, and Jinjian Wu. Segment any event streams via weighted adaptation of pivotal tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3890–3900, 2024. 5

work page 2024
[6]

Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025

Luoping Cui, Hanqing Liu, Mingjie Liu, Endian Lin, Donghong Jiang, Yuhao Wang, and Chuang Zhu. Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025. 4, 5, 7

work page 2025
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009
[8]

Standard and event cameras fusion for feature tracking

Yan Dong and Tao Zhang. Standard and event cameras fusion for feature tracking. InProceedings of the 2021 International Conference on Machine Vision and Applications, pages 55–60,

work page 2021
[9]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020. 1, 2

work page 2020
[10]

Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

Daniel Gehrig and Davide Scaramuzza. Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

work page
[11]

Asynchronous, photometric feature tracking using events and frames

Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Davide Scaramuzza. Asynchronous, photometric feature tracking using events and frames. InProceedings of the European Conference on Computer Vision (ECCV), pages 750–765,

work page
[12]

Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021. 5

work page 2021
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Real-time 3d reconstruction and 6-dof tracking with an event camera

Hanme Kim, Stefan Leutenegger, and Andrew J Davison. Real-time 3d reconstruction and 6-dof tracking with an event camera. InEuropean conference on computer vision, pages 349–364. Springer, 2016. 2

work page 2016
[15]

Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

Byounghwa Lee, Hwa Jeon Song, Young-Jin Park, and Byung Ok Kang. Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

work page
[16]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 7

work page 2023
[17]

Seeing motion at nighttime with an event camera

Haoyue Liu, Shihan Peng, Lin Zhu, Yi Chang, Hanyu Zhou, and Luxin Yan. Seeing motion at nighttime with an event camera. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25648–25658,

work page
[18]

Enhancing Event-based Object Detection with Monocular Normal Maps

Mingjie Liu, Hanqing Liu, and Chuang Zhu. Beyond rgb and events: Enhancing object detection under adverse lighting with monocular normal maps.arXiv preprint arXiv:2508.02127, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Eventgpt: Event stream understanding with multimodal large language models

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Eventgpt: Event stream understanding with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29139–29149, 2025. 2, 4, 7

work page 2025
[20]

Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024

Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, and Fei-Yue Wang. Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024. 3

work page 2024
[21]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

View selection for 3d captioning via diffusion ranking

Tiange Luo, Justin Johnson, and Honglak Lee. View selection for 3d captioning via diffusion ranking. InEuropean Con- ference on Computer Vision, pages 180–197. Springer, 2024. 4

work page 2024
[23]

Video-chatgpt: Towards detailed video understand- ing via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Khan. Video-chatgpt: Towards detailed video understand- ing via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 12585– 12602, 2024. 7 9

work page 2024
[24]

Fast event-based corner detection

Elias Mueggler, Chiara Bartolozzi, and Davide Scaramuzza. Fast event-based corner detection. 2017. 2

work page 2017
[25]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[26]

Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018

Henri Rebecq, Guillermo Gallego, Elias Mueggler, and Da- vide Scaramuzza. Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018. 2

work page 2018
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, An- drew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018

Antoni Rosinol Vidal, Henri Rebecq, Timo Horstschaefer, and Davide Scaramuzza. Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018. 2

work page 2018
[29]

Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

Ziyi Wu, Xudong Liu, and Igor Gilitschenski. Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023. 2

work page arXiv 2023
[30]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Ezsr: Event- based zero-shot recognition

Yan Yang, Liyuan Pan, Dongxu Li, and Liu Liu. Ezsr: Event- based zero-shot recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4628–4638,

work page
[32]

Frame-event alignment and fusion network for high frame rate tracking

Jiqing Zhang, Yuanchen Wang, Wenxi Liu, Meng Li, Jinpeng Bai, Baocai Yin, and Xin Yang. Frame-event alignment and fusion network for high frame rate tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9781–9790, 2023. 3

work page 2023
[33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 8

work page 2023
[34]

Eventbind: Learning a unified representation to bind them all for event-based open-world understanding

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Eventbind: Learning a unified representation to bind them all for event-based open-world understanding. InEuropean Conference on Computer Vision, pages 477–494. Springer,

work page
[35]

Rgb-event fusion for moving object detection in autonomous driving

Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. Rgb-event fusion for moving object detection in autonomous driving. arXiv preprint arXiv:2209.08323, 2022. 2, 3

work page arXiv 2022
[36]

The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018

Alex Zihao Zhu, Dinesh Thakur, Tolga¨Ozaslan, Bernd Pfrom- mer, Vijay Kumar, and Kostas Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018. 5 10

work page 2032

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

DDD17: End-To-End DAVIS Driving Dataset

Jonathan Binas, Daniel Neil, Shih-Chii Liu, and Tobi Del- bruck. Ddd17: End-to-end davis driving dataset.arXiv preprint arXiv:1711.01458, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

M3ed: Multi-robot, multi-sensor, multi-environment event dataset

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4016–4023, 2023. 5

work page 2023

[4] [4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2, 7

work page 2024

[5] [5]

Segment any event streams via weighted adaptation of pivotal tokens

Zhiwen Chen, Zhiyu Zhu, Yifan Zhang, Junhui Hou, Guang- ming Shi, and Jinjian Wu. Segment any event streams via weighted adaptation of pivotal tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3890–3900, 2024. 5

work page 2024

[6] [6]

Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025

Luoping Cui, Hanqing Liu, Mingjie Liu, Endian Lin, Donghong Jiang, Yuhao Wang, and Chuang Zhu. Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025. 4, 5, 7

work page 2025

[7] [7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009

[8] [8]

Standard and event cameras fusion for feature tracking

Yan Dong and Tao Zhang. Standard and event cameras fusion for feature tracking. InProceedings of the 2021 International Conference on Machine Vision and Applications, pages 55–60,

work page 2021

[9] [9]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020. 1, 2

work page 2020

[10] [10]

Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

Daniel Gehrig and Davide Scaramuzza. Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

work page

[11] [11]

Asynchronous, photometric feature tracking using events and frames

Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Davide Scaramuzza. Asynchronous, photometric feature tracking using events and frames. InProceedings of the European Conference on Computer Vision (ECCV), pages 750–765,

work page

[12] [12]

Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021. 5

work page 2021

[13] [13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Real-time 3d reconstruction and 6-dof tracking with an event camera

Hanme Kim, Stefan Leutenegger, and Andrew J Davison. Real-time 3d reconstruction and 6-dof tracking with an event camera. InEuropean conference on computer vision, pages 349–364. Springer, 2016. 2

work page 2016

[15] [15]

Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

Byounghwa Lee, Hwa Jeon Song, Young-Jin Park, and Byung Ok Kang. Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

work page

[16] [16]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 7

work page 2023

[17] [17]

Seeing motion at nighttime with an event camera

Haoyue Liu, Shihan Peng, Lin Zhu, Yi Chang, Hanyu Zhou, and Luxin Yan. Seeing motion at nighttime with an event camera. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25648–25658,

work page

[18] [18]

Enhancing Event-based Object Detection with Monocular Normal Maps

Mingjie Liu, Hanqing Liu, and Chuang Zhu. Beyond rgb and events: Enhancing object detection under adverse lighting with monocular normal maps.arXiv preprint arXiv:2508.02127, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Eventgpt: Event stream understanding with multimodal large language models

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Eventgpt: Event stream understanding with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29139–29149, 2025. 2, 4, 7

work page 2025

[20] [20]

Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024

Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, and Fei-Yue Wang. Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024. 3

work page 2024

[21] [21]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

View selection for 3d captioning via diffusion ranking

Tiange Luo, Justin Johnson, and Honglak Lee. View selection for 3d captioning via diffusion ranking. InEuropean Con- ference on Computer Vision, pages 180–197. Springer, 2024. 4

work page 2024

[23] [23]

Video-chatgpt: Towards detailed video understand- ing via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Khan. Video-chatgpt: Towards detailed video understand- ing via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 12585– 12602, 2024. 7 9

work page 2024

[24] [24]

Fast event-based corner detection

Elias Mueggler, Chiara Bartolozzi, and Davide Scaramuzza. Fast event-based corner detection. 2017. 2

work page 2017

[25] [25]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021

[26] [26]

Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018

Henri Rebecq, Guillermo Gallego, Elias Mueggler, and Da- vide Scaramuzza. Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018. 2

work page 2018

[27] [27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, An- drew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018

Antoni Rosinol Vidal, Henri Rebecq, Timo Horstschaefer, and Davide Scaramuzza. Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018. 2

work page 2018

[29] [29]

Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

Ziyi Wu, Xudong Liu, and Igor Gilitschenski. Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023. 2

work page arXiv 2023

[30] [30]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Ezsr: Event- based zero-shot recognition

Yan Yang, Liyuan Pan, Dongxu Li, and Liu Liu. Ezsr: Event- based zero-shot recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4628–4638,

work page

[32] [32]

Frame-event alignment and fusion network for high frame rate tracking

Jiqing Zhang, Yuanchen Wang, Wenxi Liu, Meng Li, Jinpeng Bai, Baocai Yin, and Xin Yang. Frame-event alignment and fusion network for high frame rate tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9781–9790, 2023. 3

work page 2023

[33] [33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 8

work page 2023

[34] [34]

Eventbind: Learning a unified representation to bind them all for event-based open-world understanding

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Eventbind: Learning a unified representation to bind them all for event-based open-world understanding. InEuropean Conference on Computer Vision, pages 477–494. Springer,

work page

[35] [35]

Rgb-event fusion for moving object detection in autonomous driving

Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. Rgb-event fusion for moving object detection in autonomous driving. arXiv preprint arXiv:2209.08323, 2022. 2, 3

work page arXiv 2022

[36] [36]

The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018

Alex Zihao Zhu, Dinesh Thakur, Tolga¨Ozaslan, Bernd Pfrom- mer, Vijay Kumar, and Kostas Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018. 5 10

work page 2032