pith. machine review for the scientific record. sign in

arxiv: 2604.09819 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: unknown

ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

Lukas Picek, Marek Hanzl, Michal \v{C}erm\'ak, Vojt\v{e}ch \v{C}erm\'ak

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vehicle accident detectiontraffic surveillance videosCCTV footagebenchmark datasettemporal localizationspatial localizationcollision type classificationzero-shot evaluation
0
0 comments X

The pith

A new benchmark dataset of real and synthetic CCTV clips evaluates vehicle accident detection models across supervised and zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the ACCIDENT dataset to give researchers a shared testbed for finding when and where accidents occur in traffic videos and what kind of collision happened. It mixes 2,027 real clips with 2,211 synthetic ones, each labeled for accident timing, location, and type, so models can be checked both when lots of training data exist and when they do not. Three tasks are defined and scored with custom metrics that tolerate the natural vagueness of surveillance footage. A range of baselines, from simple motion rules to vision-language models, are run on the data to show the problems remain hard. The result is a concrete way to measure progress toward reliable automatic accident detection in everyday camera feeds.

Core claim

The paper establishes that the ACCIDENT benchmark, built from 4,238 annotated traffic clips, enables consistent evaluation of accident detection through temporal localization, spatial localization, and collision-type classification, using metrics designed to reflect the uncertainty present in real CCTV recordings, and that existing approaches fall short on these measures in both data-rich and data-scarce regimes.

What carries the argument

The ACCIDENT benchmark dataset of 2,027 real and 2,211 synthetic clips with annotations for accident time, spatial location, and high-level collision type, together with three defined tasks and custom metrics that handle footage ambiguity.

If this is right

  • Models can be compared directly in IID supervised, OOD supervised, and zero-shot regimes using the same data and metrics.
  • The mix of real and synthetic clips allows testing how well methods generalize when real accident examples are limited.
  • Custom metrics that tolerate uncertainty set a practical standard for what counts as successful detection in surveillance video.
  • Baseline results establish an initial performance floor that future methods must beat to show improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption could shift research focus from collecting more data toward handling the specific ambiguities of traffic camera views.
  • The synthetic portion may prove useful for pre-training models that later adapt to real footage with few labels.
  • The three-task structure could be extended to related problems such as near-miss detection or multi-vehicle interaction analysis.

Load-bearing premise

The human-provided labels for accident timing, location, and collision type are accurate and consistent enough to serve as reliable ground truth.

What would settle it

A model that scores near the maximum on all three tasks using the paper's custom metrics would indicate the dataset or metrics do not sufficiently capture real-world ambiguity.

Figures

Figures reproduced from arXiv: 2604.09819 by Lukas Picek, Marek Hanzl, Michal \v{C}erm\'ak, Vojt\v{e}ch \v{C}erm\'ak.

Figure 1
Figure 1. Figure 1: ACCIDENT dataset reflects diversity of the real-world traffic footage. Available data vary in scene type (e.g., intersection, highway), camera viewpoint, weather conditions (e.g., rain, snow), video quality (e.g., resolution, compression), and collision type (e.g., head-on, rear-end, side-swipe), and originate from various regions across the globe (e.g., USA, Russia, the Middle East, Asia). Abstract We int… view at source ↗
Figure 2
Figure 2. Figure 2: Diversity and challenge factors in the ACCIDENT dataset. (a) Video quality is biased towards low and medium, reflecting the typical quality for CCTV feeds. (b) Scene layouts are dominated by highways and intersections with multi-agent scenarios, critical for robust model evaluation. (c) Weather distributions include a range of rain and snow conditions, enabling stress testing under adverse environments. (d… view at source ↗
Figure 3
Figure 3. Figure 3: Synthetic accident scenarios. Examples are generated from static third-person viewpoints that emulate CCTV. Road geometry, lighting, and weather are varied to stress-test detection under diverse conditions, while accident timing and agent behavior is controlled. Top: raw footage. Bottom: annotations showing class-labeled 2D boxes, collision indicator, and semantic segmentation. The clips were generated fro… view at source ↗
Figure 4
Figure 4. Figure 4: Zero-shot spatial localization with Molmo-7B. The top row shows representative successful localizations of the accident, while the bottom row shows typical failure cases, often caused by occlusion, poor lighting, low video quality, or ambiguous scenes. Bounding-box heuristic. At impact, the interacting agents are typically the closest pair in the scene, and their detection centers converge even under occlu… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: https://accidentbench.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the ACCIDENT benchmark dataset for vehicle accident detection in traffic surveillance videos. It consists of 2,027 real and 2,211 synthetic clips annotated with accident timing, spatial location, and collision type. The work defines three tasks—temporal localization, spatial localization, and collision type classification—evaluated under supervised IID, OOD, and zero-shot regimes using custom metrics that aim to handle uncertainty and ambiguity in CCTV footage. A set of baselines spanning heuristic, motion-aware, and vision-language approaches is presented to demonstrate that the benchmark is challenging.

Significance. If the ground-truth annotations prove reliable and the custom metrics are rigorously justified, ACCIDENT could serve as a useful public resource for developing and comparing accident detection models in safety-critical surveillance applications. The combination of real and synthetic data plus multiple evaluation settings (data-rich to data-scarce) addresses a practical gap; the public dataset release supports reproducibility.

major comments (2)
  1. [Dataset Construction and Annotation] Dataset Construction and Annotation: The paper provides no details on the annotation protocol, including the number of annotators, their qualifications, inter-annotator agreement scores, or procedures for resolving ambiguous cases (e.g., pinpointing accident onset in gradual collisions or spatial extent in low-visibility footage). Because the central claim is that ACCIDENT constitutes a reliable benchmark whose baseline gaps reflect genuine task difficulty, the absence of this information prevents verification that the reported labels are accurate and consistent.
  2. [Evaluation Metrics and Baselines] Evaluation Metrics and Baselines: The custom metrics intended to account for uncertainty are mentioned but never formally defined (no equations, pseudocode, or edge-case handling). Baseline descriptions remain high-level without implementation details, training protocols, or error analysis. This makes it impossible to determine whether performance differences arise from dataset properties or from metric and baseline design choices.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including one or two quantitative baseline results to immediately convey the claimed difficulty of the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and reproducibility of our benchmark. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: Dataset Construction and Annotation: The paper provides no details on the annotation protocol, including the number of annotators, their qualifications, inter-annotator agreement scores, or procedures for resolving ambiguous cases (e.g., pinpointing accident onset in gradual collisions or spatial extent in low-visibility footage). Because the central claim is that ACCIDENT constitutes a reliable benchmark whose baseline gaps reflect genuine task difficulty, the absence of this information prevents verification that the reported labels are accurate and consistent.

    Authors: We agree that a detailed account of the annotation process is necessary to support the benchmark's reliability. In the revised manuscript, we will add a dedicated subsection on annotation that specifies the number of annotators, their qualifications and training, inter-annotator agreement scores (including the metric used), and the resolution procedures for ambiguous cases such as gradual collisions or low-visibility footage. revision: yes

  2. Referee: Evaluation Metrics and Baselines: The custom metrics intended to account for uncertainty are mentioned but never formally defined (no equations, pseudocode, or edge-case handling). Baseline descriptions remain high-level without implementation details, training protocols, or error analysis. This makes it impossible to determine whether performance differences arise from dataset properties or from metric and baseline design choices.

    Authors: We acknowledge that the custom metrics and baselines require more formal and detailed presentation. In the revised version, we will introduce the metrics with explicit mathematical definitions, pseudocode, and edge-case handling for uncertainty in CCTV footage. We will also expand the baseline descriptions to include implementation details, training protocols, hyperparameters, and an error analysis that clarifies the sources of observed performance gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset introduction with no derivations or self-referential reductions

full rationale

The paper presents ACCIDENT as a new benchmark dataset of 2,027 real and 2,211 synthetic CCTV clips annotated for accident timing, spatial location, and collision type. It defines three evaluation tasks (temporal localization, spatial localization, collision classification) and supplies heuristic/motion-aware/vision-language baselines, showing the benchmark is challenging. No equations, fitted parameters, or derivation chains appear in the provided text. Claims rest on dataset curation and custom metrics that explicitly account for CCTV ambiguity; these are presented as contributions rather than reductions to prior self-citations or inputs by construction. This matches the default case of a self-contained dataset paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is the curation and annotation of video clips plus task and metric definitions rather than any new theoretical constructs or fitted parameters.

axioms (1)
  • domain assumption Annotations for accident time, location, and collision type are accurate and unambiguous enough for reliable evaluation.
    The benchmark relies on these annotations for all tasks and metrics.

pith-pipeline@v0.9.0 · 5471 in / 1273 out tokens · 40535 ms · 2026-05-10T16:34:41.830202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

    cs.CV 2026-05 unverdicted novelty 6.0

    A two-pass pipeline with Qwen3-VL-Plus and Gemini 3.1 Flash-Lite achieves 0.539 accuracy on the ACCIDENT@CVPR 2026 benchmark of 2,027 real CCTV videos for zero-shot temporal-spatial grounding of traffic events.

Reference graph

Works this paper leans on

45 extracted references · 5 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Battlefield advanced trauma life support (batls).BMJ Mili- tary Health, 146(2):110–114, 2000. 1

  2. [2]

    Traffic accident detection video dataset for ai-driven computer vision systems in smart city transportation, 2023

    Victor Adewopo, Nelly Elsayed, Zag ElSayed, Murat Ozer, Constantinos Zekios, Ahmed Abdelgawad, and Magdy Bay- oumi. Traffic accident detection video dataset for ai-driven computer vision systems in smart city transportation, 2023. 2

  3. [3]

    Collision detection: An improved deep learning approach using senet and resnext

    Aloukik Aditya, Liudu Zhou, Hrishika Vachhani, Dhivya Chandrasekaran, and Vijay Mago. Collision detection: An improved deep learning approach using senet and resnext. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2021. 1, 2

  4. [4]

    Collaborative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new baseline

    Anas Al-Lahham, Muhammad Zaigham Zaheer, Nurbek Tastan, and Karthik Nandakumar. Collaborative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12416–12425, 2024. 2

  5. [5]

    Idda: A large-scale multi-domain dataset for autonomous driving.IEEE Robotics and Automation Letters, 5(4):5526–5533, 2020

    Emanuele Alberti, Antonio Tavera, Carlo Masone, and Bar- bara Caputo. Idda: A large-scale multi-domain dataset for autonomous driving.IEEE Robotics and Automation Letters, 5(4):5526–5533, 2020. 2

  6. [6]

    A kernel multiple change-point algorithm via model selection.Jour- nal of machine learning research, 20(162):1–56, 2019

    Sylvain Arlot, Alain Celisse, and Zaid Harchaoui. A kernel multiple change-point algorithm via model selection.Jour- nal of machine learning research, 20(162):1–56, 2019. 6

  7. [7]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6

  8. [8]

    Uncertainty-based traffic accident anticipation with spatio-temporal relational learn- ing

    Wentao Bao, Qi Yu, and Yu Kong. Uncertainty-based traffic accident anticipation with spatio-temporal relational learn- ing. InACM Multimedia Conference, 2020. 2

  9. [9]

    Drive: Deep reinforced accident anticipation with visual explanation

    Wentao Bao, Qi Yu, and Yu Kong. Drive: Deep reinforced accident anticipation with visual explanation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7619–7628, 2021. 2

  10. [10]

    New efficient algorithms for multiple change- point detection with reproducing kernels.Computational Statistics & Data Analysis, 128:200–220, 2018

    Alain Celisse, Guillemette Marot, Morgane Pierre-Jean, and GJ Rigaill. New efficient algorithms for multiple change- point detection with reproducing kernels.Computational Statistics & Data Analysis, 128:200–220, 2018. 6

  11. [11]

    Tads: a novel dataset for road traffic acci- dent detection from a surveillance perspective.The Journal of Supercomputing, 80(18):26226–26249, 2024

    Yachuang Chai, Jianwu Fang, Haoquan Liang, and Wushouer Silamu. Tads: a novel dataset for road traffic acci- dent detection from a surveillance perspective.The Journal of Supercomputing, 80(18):26226–26249, 2024. 1, 2

  12. [12]

    Anticipating accidents in dashcam videos

    Fu-Hsiang Chan, Yu-Ting Chen, Yu Xiang, and Min Sun. Anticipating accidents in dashcam videos. InAsian Confer- ence on Computer Vision, pages 136–153. Springer, 2016. 1, 2

  13. [13]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,

  14. [14]

    Trafficvlm: A controllable visual lan- guage model for traffic video captioning

    Quang Minh Dinh, Minh Khoi Ho, Anh Quan Dang, and Hung Phong Tran. Trafficvlm: A controllable visual lan- guage model for traffic video captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7134–7143, 2024. 2

  15. [15]

    Dada-2000: Can driving accident be pre- dicted by driver attention? analyzed by a benchmark, 2019

    Jianwu Fang, Dingxin Yan, Jiahuan Qiao, Jianru Xue, He Wang, and Sen Li. Dada-2000: Can driving accident be pre- dicted by driver attention? analyzed by a benchmark, 2019. 1, 2

  16. [16]

    Vision-based traffic accident detection and anticipation: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):1983–1999, 2023

    Jianwu Fang, Jiahuan Qiao, Jianru Xue, and Zhengguo Li. Vision-based traffic accident detection and anticipation: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):1983–1999, 2023. 1

  17. [17]

    Two-frame motion estimation based on polynomial expansion

    Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. InScandinavian conference on Im- age analysis, pages 363–370. Springer, 2003. 6

  18. [18]

    Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection

    Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1705–1714,

  19. [19]

    Open source computer vision library.https:// github.com/itseez/opencv, 2015

    Itseez. Open source computer vision library.https:// github.com/itseez/opencv, 2015. 6

  20. [20]

    Ultralytics yolo11, 2024

    Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 6

  21. [21]

    Crash to not crash: Learn to identify dangerous vehicles using a simulator.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):978–985, 2019

    Hoon Kim, Kangwook Lee, Gyeongjo Hwang, and Changho Suh. Crash to not crash: Learn to identify dangerous vehicles using a simulator.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):978–985, 2019. 2

  22. [22]

    V2x-sim: Multi-agent col- laborative perception dataset and benchmark for autonomous driving, 2022

    Yiming Li, Dekun Ma, Ziyan An, Zixun Wang, Yiqi Zhong, Siheng Chen, and Chen Feng. V2x-sim: Multi-agent col- laborative perception dataset and benchmark for autonomous driving, 2022. 2

  23. [23]

    Fu- ture frame prediction for anomaly detection–a new baseline

    Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Fu- ture frame prediction for anomaly detection–a new baseline. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 2

  24. [24]

    A hybrid video anomaly detection frame- work via memory-augmented flow reconstruction and flow- guided frame prediction

    Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection frame- work via memory-augmented flow reconstruction and flow- guided frame prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13588–13597, 2021. 2

  25. [25]

    A simulation-based frame- work for urban traffic accident detection

    Haohan Luo and Feng Wang. A simulation-based frame- work for urban traffic accident detection. InICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 2

  26. [26]

    Anomaly detec- tion in video using predictive convolutional long short-term memory networks.arXiv preprint arXiv:1612.00390, 2016

    Jefferson Ryan Medel and Andreas Savakis. Anomaly detec- tion in video using predictive convolutional long short-term memory networks.arXiv preprint arXiv:1612.00390, 2016. 2

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, 9 Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6, 7

  28. [28]

    World Health Organization, 2023

    World Health Organization.Global status report on road safety 2023: summary. World Health Organization, 2023. 1

  29. [29]

    Zero-shot hazard identification in autonomous driving: A case study on the coool benchmark

    Lukas Picek, V ojtech Cermak, and Marek Hanzl. Zero-shot hazard identification in autonomous driving: A case study on the coool benchmark. InProceedings of the Winter Con- ference on Applications of Computer Vision, pages 654–663,

  30. [30]

    The golden hour in trauma: dogma or medical folk- lore?Injury, 46(4):525–527, 2015

    Frederick B Rogers, Katelyn J Rittenhouse, and Brian W Gross. The golden hour in trauma: dogma or medical folk- lore?Injury, 46(4):525–527, 2015. 1

  31. [31]

    Cadp: A novel dataset for cctv traffic camera based accident analysis, 2018

    Ankit Shah, Jean Baptiste Lamare, Tuan Nguyen Anh, and Alexander Hauptmann. Cadp: A novel dataset for cctv traffic camera based accident analysis, 2018. 1, 2, 3

  32. [32]

    Hand, and Kostas Alexis

    Harpreet Singh, Emily M. Hand, and Kostas Alexis. Anoma- lous motion detection on highway using deep learning, 2020. 2

  33. [33]

    Synthetic datasets for au- tonomous driving: A survey.IEEE Transactions on Intel- ligent Vehicles, 9(1):1847–1864, 2024

    Zhihang Song, Zimin He, Xingyu Li, Qiming Ma, Ruibo Ming, Zhiqi Mao, Huaxin Pei, Lihui Peng, Jianming Hu, Danya Yao, and Yi Zhang. Synthetic datasets for au- tonomous driving: A survey.IEEE Transactions on Intel- ligent Vehicles, 9(1):1847–1864, 2024. 2

  34. [34]

    Real-world anomaly detection in surveillance videos, 2019

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos, 2019. 2

  35. [35]

    Shift: a synthetic driving dataset for continuous multi-task domain adaptation

    Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, Luc Van Gool, Bernt Schiele, Federico Tombari, and Fisher Yu. Shift: a synthetic driving dataset for continuous multi-task domain adaptation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21371–21382, 2022. 2

  36. [36]

    Selec- tive review of offline change point detection methods.Signal Processing, 167:107299, 2020

    Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selec- tive review of offline change point detection methods.Signal Processing, 167:107299, 2020. 6

  37. [37]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 6, 7

  38. [38]

    Detection of road ac- cidents using synthetically generated multi-perspective acci- dent videos.IEEE Transactions on Intelligent Transporta- tion Systems, 24(2):1926–1935, 2023

    Thakare Kamalakar Vijay, Debi Prosad Dogra, Heeseung Choi, Gipyo Nam, and Ig-Jae Kim. Detection of road ac- cidents using synthetically generated multi-perspective acci- dent videos.IEEE Transactions on Intelligent Transporta- tion Systems, 24(2):1926–1935, 2023. 2

  39. [39]

    Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving

    Tianqi Wang, Sukmin Kim, Ji Wenxuan, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, and Ping Luo. Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5599– 5606, 2024. 1, 2

  40. [40]

    Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communica- tion

    Runsheng Xu, Hao Xiang, Xin Xia, Xu Han, Jinlong Li, and Jiaqi Ma. Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communica- tion. In2022 International Conference on Robotics and Au- tomation (ICRA), pages 2583–2589. IEEE, 2022. 2

  41. [41]

    Tad: A large-scale benchmark for traffic accidents detection from video surveillance.IEEE Access, 13:2018–2033, 2025

    Yajun Xu, Huan Hu, Chuwen Huang, Yibing Nan, Yuyao Liu, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. Tad: A large-scale benchmark for traffic accidents detection from video surveillance.IEEE Access, 13:2018–2033, 2025. 1, 2, 3

  42. [42]

    Crandall, and Ella M

    Yu Yao, Mingze Xu, Yuchen Wang, David J. Crandall, and Ella M. Atkins. Unsupervised traffic accident detection in first-person videos, 2019. 2

  43. [43]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 2

  44. [44]

    When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis.Accident Analysis & Prevention, 219:108077, 2025

    Ruixuan Zhang, Beichen Wang, Juexiao Zhang, Zilin Bian, Chen Feng, and Kaan Ozbay. When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis.Accident Analysis & Prevention, 219:108077, 2025. 2

  45. [45]

    Towards vi- sion zero: The tum traffic accid3nd dataset

    Walter Zimmer, Ross Greer, Xingcheng Zhou, Rui Song, Hu Cao, Daniel Lehmberg, Marc Pavel, Ahmed Alaaeldin Ghita, Akshay Gopalkrishnan, Holger Caesar, et al. Towards vi- sion zero: The tum traffic accid3nd dataset. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 841–851, 2025. 1, 2 10 A. Additional Dataset Statistics The ...