pith. machine review for the scientific record. sign in

arxiv: 2604.08457 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Bin Ran, Junyi Ma, Kai Chen, Pei Li, Rui Gan, Sikai Chen, Xingyou Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords traffic crash understandingvision-language modelsvideo benchmarkinfrastructure-centric perceptioncausal reasoningtemporal reasoningcooperative autonomous drivingscene understanding
0
0 comments X

The pith

A roadside-camera benchmark of 250 crash videos shows vision-language models describe scenes but fail at temporal and causal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CrashSight as a new benchmark built from real roadside camera footage of traffic crashes. It supplies 13,000 multiple-choice questions split into two tiers: the first checks whether models correctly identify the visible scene and participants, while the second asks about crash mechanics, causes, timing, and results after the event. When eight current vision-language models are tested, they perform adequately on description tasks yet consistently falter on questions that require connecting events across time or determining why a crash occurred. This gap matters for cooperative autonomous driving, where infrastructure cameras must supply reliable understanding of safety-critical moments that ego-vehicle views alone cannot provide.

Core claim

CrashSight supplies 250 real-world roadside crash videos together with 13K question-answer pairs under a two-tier taxonomy. Tier 1 measures visual grounding of scene context and involved parties. Tier 2 measures higher-level reasoning that includes crash mechanics, causal attribution, temporal progression, and post-crash outcomes. Evaluation of eight state-of-the-art vision-language models on the benchmark shows strong scene-description performance alongside clear weakness in temporal and causal reasoning within safety-critical traffic scenarios.

What carries the argument

The two-tier taxonomy of questions applied to 250 infrastructure-view crash videos, with Tier 1 testing visual grounding and Tier 2 testing reasoning over mechanics, causation, time, and outcomes.

If this is right

  • Vision-language models need explicit mechanisms to track event sequences and infer causes in dynamic roadside scenes.
  • Infrastructure camera data can fill gaps left by ego-vehicle views for complete traffic crash understanding.
  • The benchmark supplies a repeatable way to measure progress toward reliable VLM use in cooperative driving systems.
  • Failure patterns identified in the benchmark point to concrete weaknesses in handling phase-aware crash progression.
  • A standardized infrastructure-centric evaluation supports targeted improvements in perception for autonomous vehicles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar two-tier benchmarks could expose reasoning shortfalls in other time-critical domains such as medical event monitoring or industrial safety.
  • Training pipelines for vision-language models would benefit from larger volumes of explicitly annotated temporal and causal examples drawn from real incidents.
  • Pairing CrashSight-style tests with controlled simulations could isolate whether weaknesses stem from data scarcity or architectural limits.
  • Widespread adoption of infrastructure-assisted benchmarks may speed integration of roadside perception into smart-city traffic systems.

Load-bearing premise

The 250 videos and their annotations under the two-tier taxonomy accurately represent real-world crash scenarios and test genuine reasoning capabilities rather than superficial pattern matching.

What would settle it

A model that scores high on both tiers of the CrashSight questions, especially the temporal and causal items, while retaining strong results on existing general benchmarks would directly contradict the reported performance gap.

Figures

Figures reproduced from arXiv: 2604.08457 by Bin Ran, Junyi Ma, Kai Chen, Pei Li, Rui Gan, Sikai Chen, Xingyou Yang.

Figure 1
Figure 1. Figure 1: Overview of CrashSight-VQA. (a) Phase-aware temporal structure of a crash video. (b) VLM performance comparison across 7 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CrashSight benchmark curation pipeline. Surveillance videos are processed through a three-stage annotation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: QA taxonomy of CrashSight. Seven categories are [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset statistics of CrashSight-VQA. (a) Ground-truth [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CrashSight, a benchmark of 250 real-world roadside crash videos from infrastructure cameras, annotated with 13K multiple-choice QA pairs under a two-tier taxonomy. Tier 1 targets visual grounding of scene context and involved parties; Tier 2 targets higher-level reasoning on crash mechanics, causal attribution, temporal progression, and post-crash outcomes. The authors evaluate eight state-of-the-art VLMs and report that models perform well on scene description but struggle on temporal and causal reasoning in safety-critical scenarios. The work positions the benchmark as a standardized framework for infrastructure-assisted perception in cooperative autonomous driving and releases the dataset and code.

Significance. If the Tier 2 questions genuinely isolate causal and temporal reasoning rather than permitting static-frame or language-prior shortcuts, the reported performance gaps would provide actionable guidance for improving VLMs in safety-critical traffic understanding. The infrastructure-centric focus fills a gap left by ego-vehicle benchmarks, and the public release of the 250-video corpus with 13K QA pairs constitutes a concrete resource for the community.

major comments (2)
  1. [Dataset Construction / Experiments] Dataset Construction and Experiments sections: The central claim that current VLMs 'struggle with temporal and causal reasoning' rests on Tier 2 questions genuinely requiring those faculties. The manuscript provides no human performance baselines, inter-annotator agreement statistics, or ablation results (e.g., single-frame vs. full-video inputs) to rule out shortcut solutions based on static cues or language priors. Without these controls, the observed score gap could reflect annotation artifacts rather than a reasoning deficit.
  2. [Abstract / Dataset Construction] Abstract and Section 3 (or equivalent): Video selection criteria, annotation methodology, and quality-control procedures for the 250 videos and 13K QA pairs are not described in sufficient detail to assess whether the corpus accurately represents real-world crash diversity or whether Tier 2 items were validated for reasoning depth.
minor comments (2)
  1. [Abstract] The abstract states the dataset size and taxonomy but does not report quantitative VLM results or failure-mode statistics; these should be summarized with at least one key table reference for readers who stop at the abstract.
  2. [Experiments] Ensure that all model names, exact prompt templates, and evaluation metrics (accuracy, etc.) are explicitly listed in the experimental setup rather than left to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of validating the benchmark's claims regarding VLM limitations in temporal and causal reasoning. We address each major comment below and have revised the manuscript to incorporate additional controls and details.

read point-by-point responses
  1. Referee: Dataset Construction / Experiments sections: The central claim that current VLMs 'struggle with temporal and causal reasoning' rests on Tier 2 questions genuinely requiring those faculties. The manuscript provides no human performance baselines, inter-annotator agreement statistics, or ablation results (e.g., single-frame vs. full-video inputs) to rule out shortcut solutions based on static cues or language priors. Without these controls, the observed score gap could reflect annotation artifacts rather than a reasoning deficit.

    Authors: We agree that these controls are necessary to substantiate the claim that performance gaps reflect reasoning deficits rather than artifacts. In the revised manuscript, we have added: (i) human performance baselines from multiple expert annotators on a stratified subset of Tier 2 questions; (ii) inter-annotator agreement statistics (Cohen's kappa) computed during QA pair validation; and (iii) an ablation study evaluating all eight VLMs on single-frame (middle frame) versus full-video inputs. The results show that full-video inputs yield only marginal gains on Tier 1 but substantial improvements on Tier 2 for models with video capabilities, supporting that the gaps are driven by temporal/causal demands rather than static cues or language priors alone. revision: yes

  2. Referee: Abstract / Dataset Construction: Video selection criteria, annotation methodology, and quality-control procedures for the 250 videos and 13K QA pairs are not described in sufficient detail to assess whether the corpus accurately represents real-world crash diversity or whether Tier 2 items were validated for reasoning depth.

    Authors: We acknowledge that the original description was insufficiently detailed. We have expanded Section 3 with a dedicated subsection on dataset construction that now specifies: video selection criteria (diversity across crash severity, vehicle types, weather, lighting, and geographic locations drawn from multiple infrastructure sources); annotation methodology (two-stage process with initial question generation by domain experts followed by multiple-choice option creation to minimize language priors, with explicit targeting of causal/temporal elements in Tier 2); and quality-control procedures (independent review by three annotators per item, adjudication of disagreements, and post-hoc validation that Tier 2 questions cannot be solved from a single frame or generic priors via pilot testing). These additions allow readers to evaluate the corpus's representativeness and the reasoning depth of the questions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent dataset and model evaluations

full rationale

The paper creates a new 250-video dataset with 13K QA pairs under an explicit two-tier taxonomy, then reports direct VLM performance numbers on it. No equations, fitted parameters, predictions derived from the same data, or self-citation chains appear in the provided text. Tier 1/Tier 2 distinctions and failure analyses are presented as empirical observations, not as derivations that reduce to the inputs by construction. The central claim (strong scene description but weak temporal/causal reasoning) rests on external model testing against the released benchmark, which is falsifiable and not self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests primarily on the creation of a new annotated video dataset and custom evaluation taxonomy; the main unverified premise is that the chosen videos and questions validly measure the targeted reasoning skills.

axioms (1)
  • domain assumption Roadside camera footage contains sufficient visual detail to support reliable human annotation of crash mechanics, causality, and temporal progression.
    Invoked implicitly when constructing the Tier 2 questions on crash mechanics and causal attribution.

pith-pipeline@v0.9.0 · 5534 in / 1334 out tokens · 60490 ms · 2026-05-10T17:06:19.443779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. An Agentic Workflow for Detecting Personally Identifiable Information in Crash Narratives

    cs.CR 2026-04 unverdicted novelty 6.0

    A hybrid agentic workflow using Presidio for structured PII and fine-tuned LLMs plus verification for names, addresses, and identifiers detects PII in crash narratives at 0.82 precision and 0.94 recall.

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Maplm: A real-world large-scale vision-language benchmark for map and traffic scene un- derstanding

    Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819– 21830, 2024. 2

  2. [2]

    PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025. 8

  3. [3]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized lan- guage models.arXiv preprint arXiv:2305.14314, 2023. Ac- cepted to NeurIPS 2023. 5

  4. [4]

    Trafficvlm: A controllable visual lan- guage model for traffic video captioning

    Quang Minh Dinh, Minh Khoi Ho, Anh Quan Dang, and Hung Phong Tran. Trafficvlm: A controllable visual lan- guage model for traffic video captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7134–7143, 2024. 2

  5. [5]

    Dada-2000: Can driving accident be pre- dicted by driver attentionƒ analyzed by a benchmark

    Jianwu Fang, Dingxin Yan, Jiahuan Qiao, Jianru Xue, He Wang, and Sen Li. Dada-2000: Can driving accident be pre- dicted by driver attentionƒ analyzed by a benchmark. In2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 4303–4309. IEEE, 2019. 3

  6. [6]

    Abductive ego-view accident video understanding for safe driving per- ception

    Jianwu Fang, Lei-lei Li, Junfei Zhou, Junbin Xiao, Hongkai Yu, Chen Lv, Jianru Xue, and Tat-Seng Chua. Abductive ego-view accident video understanding for safe driving per- ception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22030– 22040, 2024. 3

  7. [7]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Zhao Peiyuan, Jia Hao, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5

  8. [8]

    Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

    Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xi- aoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 3302–3310, 2025. 8

  9. [9]

    Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding

    Younggun Kim, Ahmed S Abdelrahman, and Mohamed Abdel-Aty. Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 761–771,

  10. [10]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProc. ACM SIGOPS Symp. Oper. Syst. Principles (SOSP), 2023. vLLM. 3

  11. [11]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 5

  12. [12]

    Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

    Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013. 3

  13. [13]

    Im- proving llm video understanding with 16 frames per second

    Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Im- proving llm video understanding with 16 frames per second. arXiv preprint arXiv:2503.13956, 2025. 8

  14. [14]

    Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els.arXiv preprint arXiv:2505.12589, 2025

    Bo Liu, Pengfei Qiao, Minhan Ma, Xuange Zhang, Yi- nan Tang, Peng Xu, Kun Liu, and Tongtong Yuan. Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els.arXiv preprint arXiv:2505.12589, 2025. 1, 3

  15. [15]

    Fu- ture frame prediction for anomaly detection–a new baseline

    Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Fu- ture frame prediction for anomaly detection–a new baseline. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 3

  16. [16]

    Abnormal event detec- tion at 150 fps in matlab

    Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detec- tion at 150 fps in matlab. InProceedings of the IEEE inter- national conference on computer vision, pages 2720–2727,

  17. [17]

    A simulation-based frame- work for urban traffic accident detection

    Haohan Luo and Feng Wang. A simulation-based frame- work for urban traffic accident detection. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 3

  18. [18]

    Conan: Progressive learning to reason like a detective over multi-scale visual ev- idence.arXiv preprint arXiv:2510.20470, 2025

    Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Conan: Progressive learning to reason like a detective over multi-scale visual ev- idence.arXiv preprint arXiv:2510.20470, 2025. 8

  19. [19]

    Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from so- cial video narratives

    Chirag Parikh, Deepti Rawat, Tathagata Ghosh, Ravi Ki- ran Sarvadevabhatla, et al. Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from so- cial video narratives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19002–19011,

  20. [20]

    Cot-vlm4tar: Chain-of-thought guided vision- language models for traffic anomaly resolution.arXiv preprint arXiv:2503.01632, 2025

    Tianchi Ren, Haibo Hu, Jiacheng Zuo, Xinhong Chen, Jianping Wang, Chun Jason Xue, Jen-Ming Wu, and Nan Guan. Cot-vlm4tar: Chain-of-thought guided vision- language models for traffic anomaly resolution.arXiv preprint arXiv:2503.01632, 2025. 2

  21. [21]

    Safeplug: Empow- ering multimodal llms with pixel-level insight and temporal grounding for traffic accident understanding.arXiv preprint arXiv:2508.06763, 2025

    Zihao Sheng, Zilin Huang, Yansong Qu, Jiancong Chen, Yuhao Luo, Yen-Jung Chen, Yue Leng, and Sikai Chen. Safeplug: Empowering multimodal llms with pixel-level in- sight and temporal grounding for traffic accident understand- ing.arXiv preprint arXiv:2508.06763, 2025. 3

  22. [22]

    Real-world anomaly detection in surveillance videos

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 1, 3

  23. [23]

    Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving

    Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4567–4576, 2025. 1, 2

  24. [24]

    Interact-video: Reasoning-rich video qa for urban traffic.arXiv preprint arXiv:2507.14743, 2025

    Joseph Raj Vishal, Divesh Basina, Rutuja Patil, Manas Srini- vas Gowda, Katha Naik, Yezhou Yang, and Bharatesh Chakravarthi. Interact-video: Reasoning-rich video qa for urban traffic.arXiv preprint arXiv:2507.14743, 2025. 3

  25. [25]

    Qwen2.5-VL Technical Report

    Peng Wang, Shuai Bai, Sinan Gao, Jialin Wang Gar- cia, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

  26. [26]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 8

  27. [27]

    DriveQA: Passing the driving knowledge test.arXiv preprint arXiv:2508.21824, 2025

    Maolin Wei, Wanzhou Liu, and Eshed Ohn-Bar. DriveQA: Passing the driving knowledge test.arXiv preprint arXiv:2508.21824, 2025. Accepted to ICCV 2025. 1, 2

  28. [28]

    Hazardvlm: A video language model for real- time hazard description in automated driving systems.IEEE Transactions on Intelligent Vehicles, 2024

    Dannier Xiao, Mehrdad Dianati, Paul Jennings, and Roger Woodman. Hazardvlm: A video language model for real- time hazard description in automated driving systems.IEEE Transactions on Intelligent Vehicles, 2024. 2

  29. [29]

    Sutd-trafficqa: A question answering benchmark and an efficient network for video rea- soning over traffic events

    Li Xu, He Huang, and Jun Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video rea- soning over traffic events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9878–9888, 2021. 1, 2, 3

  30. [30]

    TAD: A large-scale benchmark for traffic accidents detection from video surveillance.IEEE Access, 13:2018–2033, 2025

    Yajun Xu, Chengwei Huang, Yong Nan, and Shiguo Lian. TAD: A large-scale benchmark for traffic accidents detection from video surveillance.IEEE Access, 13:2018–2033, 2025. 3

  31. [31]

    V2x-vlm: End-to-end v2x cooperative autonomous driving through large vision-language models.Transportation Re- search Part C: Emerging Technologies, 183:105457, 2026

    Junwei You, Zhuoyu Jiang, Zilin Huang, Haotian Shi, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, and Bin Ran. V2x-vlm: End-to-end v2x cooperative autonomous driving through large vision-language models.Transportation Re- search Part C: Emerging Technologies, 183:105457, 2026. 1

  32. [32]

    Traffic accident bench- mark for causality recognition

    Tackgeun You and Bohyung Han. Traffic accident bench- mark for causality recognition. InEuropean Conference on Computer Vision, pages 540–556. Springer, 2020. 3

  33. [33]

    Towards surveillance video-and-language understanding: New dataset baselines and challenges

    Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset baselines and challenges. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22052– 22061, 2024. 3

  34. [34]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995– 9006, 2025. 8

  35. [35]

    Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Wal- ter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Vennkat- narayanan Lakshminarasimhan, Leah Strand, and Alois C. Knoll. TUMTraffic-VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv preprint arXiv:2502.02449, 2025. 1, 3

  36. [36]

    Tau-106k: A new dataset for comprehensive understanding of traffic accident

    Yixuan Zhou, Long Bai, Sijia Cai, Bing Deng, Xing Xu, and Heng Tao Shen. Tau-106k: A new dataset for comprehensive understanding of traffic accident. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. 1, 3

  37. [37]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 5