arxiv: 2604.08457 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Bin Ran, Junyi Ma, Kai Chen, Pei Li, Rui Gan, Sikai Chen, Xingyou Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords traffic crash understandingvision-language modelsvideo benchmarkinfrastructure-centric perceptioncausal reasoningtemporal reasoningcooperative autonomous drivingscene understanding

0 comments

The pith

A roadside-camera benchmark of 250 crash videos shows vision-language models describe scenes but fail at temporal and causal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CrashSight as a new benchmark built from real roadside camera footage of traffic crashes. It supplies 13,000 multiple-choice questions split into two tiers: the first checks whether models correctly identify the visible scene and participants, while the second asks about crash mechanics, causes, timing, and results after the event. When eight current vision-language models are tested, they perform adequately on description tasks yet consistently falter on questions that require connecting events across time or determining why a crash occurred. This gap matters for cooperative autonomous driving, where infrastructure cameras must supply reliable understanding of safety-critical moments that ego-vehicle views alone cannot provide.

Core claim

CrashSight supplies 250 real-world roadside crash videos together with 13K question-answer pairs under a two-tier taxonomy. Tier 1 measures visual grounding of scene context and involved parties. Tier 2 measures higher-level reasoning that includes crash mechanics, causal attribution, temporal progression, and post-crash outcomes. Evaluation of eight state-of-the-art vision-language models on the benchmark shows strong scene-description performance alongside clear weakness in temporal and causal reasoning within safety-critical traffic scenarios.

What carries the argument

The two-tier taxonomy of questions applied to 250 infrastructure-view crash videos, with Tier 1 testing visual grounding and Tier 2 testing reasoning over mechanics, causation, time, and outcomes.

If this is right

Vision-language models need explicit mechanisms to track event sequences and infer causes in dynamic roadside scenes.
Infrastructure camera data can fill gaps left by ego-vehicle views for complete traffic crash understanding.
The benchmark supplies a repeatable way to measure progress toward reliable VLM use in cooperative driving systems.
Failure patterns identified in the benchmark point to concrete weaknesses in handling phase-aware crash progression.
A standardized infrastructure-centric evaluation supports targeted improvements in perception for autonomous vehicles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar two-tier benchmarks could expose reasoning shortfalls in other time-critical domains such as medical event monitoring or industrial safety.
Training pipelines for vision-language models would benefit from larger volumes of explicitly annotated temporal and causal examples drawn from real incidents.
Pairing CrashSight-style tests with controlled simulations could isolate whether weaknesses stem from data scarcity or architectural limits.
Widespread adoption of infrastructure-assisted benchmarks may speed integration of roadside perception into smart-city traffic systems.

Load-bearing premise

The 250 videos and their annotations under the two-tier taxonomy accurately represent real-world crash scenarios and test genuine reasoning capabilities rather than superficial pattern matching.

What would settle it

A model that scores high on both tiers of the CrashSight questions, especially the temporal and causal items, while retaining strong results on existing general benchmarks would directly contradict the reported performance gap.

Figures

Figures reproduced from arXiv: 2604.08457 by Bin Ran, Junyi Ma, Kai Chen, Pei Li, Rui Gan, Sikai Chen, Xingyou Yang.

**Figure 1.** Figure 1: Overview of CrashSight-VQA. (a) Phase-aware temporal structure of a crash video. (b) VLM performance comparison across 7 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the CrashSight benchmark curation pipeline. Surveillance videos are processed through a three-stage annotation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: QA taxonomy of CrashSight. Seven categories are [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset statistics of CrashSight-VQA. (a) Ground-truth [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CrashSight, a benchmark of 250 real-world roadside crash videos from infrastructure cameras, annotated with 13K multiple-choice QA pairs under a two-tier taxonomy. Tier 1 targets visual grounding of scene context and involved parties; Tier 2 targets higher-level reasoning on crash mechanics, causal attribution, temporal progression, and post-crash outcomes. The authors evaluate eight state-of-the-art VLMs and report that models perform well on scene description but struggle on temporal and causal reasoning in safety-critical scenarios. The work positions the benchmark as a standardized framework for infrastructure-assisted perception in cooperative autonomous driving and releases the dataset and code.

Significance. If the Tier 2 questions genuinely isolate causal and temporal reasoning rather than permitting static-frame or language-prior shortcuts, the reported performance gaps would provide actionable guidance for improving VLMs in safety-critical traffic understanding. The infrastructure-centric focus fills a gap left by ego-vehicle benchmarks, and the public release of the 250-video corpus with 13K QA pairs constitutes a concrete resource for the community.

major comments (2)

[Dataset Construction / Experiments] Dataset Construction and Experiments sections: The central claim that current VLMs 'struggle with temporal and causal reasoning' rests on Tier 2 questions genuinely requiring those faculties. The manuscript provides no human performance baselines, inter-annotator agreement statistics, or ablation results (e.g., single-frame vs. full-video inputs) to rule out shortcut solutions based on static cues or language priors. Without these controls, the observed score gap could reflect annotation artifacts rather than a reasoning deficit.
[Abstract / Dataset Construction] Abstract and Section 3 (or equivalent): Video selection criteria, annotation methodology, and quality-control procedures for the 250 videos and 13K QA pairs are not described in sufficient detail to assess whether the corpus accurately represents real-world crash diversity or whether Tier 2 items were validated for reasoning depth.

minor comments (2)

[Abstract] The abstract states the dataset size and taxonomy but does not report quantitative VLM results or failure-mode statistics; these should be summarized with at least one key table reference for readers who stop at the abstract.
[Experiments] Ensure that all model names, exact prompt templates, and evaluation metrics (accuracy, etc.) are explicitly listed in the experimental setup rather than left to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of validating the benchmark's claims regarding VLM limitations in temporal and causal reasoning. We address each major comment below and have revised the manuscript to incorporate additional controls and details.

read point-by-point responses

Referee: Dataset Construction / Experiments sections: The central claim that current VLMs 'struggle with temporal and causal reasoning' rests on Tier 2 questions genuinely requiring those faculties. The manuscript provides no human performance baselines, inter-annotator agreement statistics, or ablation results (e.g., single-frame vs. full-video inputs) to rule out shortcut solutions based on static cues or language priors. Without these controls, the observed score gap could reflect annotation artifacts rather than a reasoning deficit.

Authors: We agree that these controls are necessary to substantiate the claim that performance gaps reflect reasoning deficits rather than artifacts. In the revised manuscript, we have added: (i) human performance baselines from multiple expert annotators on a stratified subset of Tier 2 questions; (ii) inter-annotator agreement statistics (Cohen's kappa) computed during QA pair validation; and (iii) an ablation study evaluating all eight VLMs on single-frame (middle frame) versus full-video inputs. The results show that full-video inputs yield only marginal gains on Tier 1 but substantial improvements on Tier 2 for models with video capabilities, supporting that the gaps are driven by temporal/causal demands rather than static cues or language priors alone. revision: yes
Referee: Abstract / Dataset Construction: Video selection criteria, annotation methodology, and quality-control procedures for the 250 videos and 13K QA pairs are not described in sufficient detail to assess whether the corpus accurately represents real-world crash diversity or whether Tier 2 items were validated for reasoning depth.

Authors: We acknowledge that the original description was insufficiently detailed. We have expanded Section 3 with a dedicated subsection on dataset construction that now specifies: video selection criteria (diversity across crash severity, vehicle types, weather, lighting, and geographic locations drawn from multiple infrastructure sources); annotation methodology (two-stage process with initial question generation by domain experts followed by multiple-choice option creation to minimize language priors, with explicit targeting of causal/temporal elements in Tier 2); and quality-control procedures (independent review by three annotators per item, adjudication of disagreements, and post-hoc validation that Tier 2 questions cannot be solved from a single frame or generic priors via pilot testing). These additions allow readers to evaluate the corpus's representativeness and the reasoning depth of the questions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent dataset and model evaluations

full rationale

The paper creates a new 250-video dataset with 13K QA pairs under an explicit two-tier taxonomy, then reports direct VLM performance numbers on it. No equations, fitted parameters, predictions derived from the same data, or self-citation chains appear in the provided text. Tier 1/Tier 2 distinctions and failure analyses are presented as empirical observations, not as derivations that reduce to the inputs by construction. The central claim (strong scene description but weak temporal/causal reasoning) rests on external model testing against the released benchmark, which is falsifiable and not self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests primarily on the creation of a new annotated video dataset and custom evaluation taxonomy; the main unverified premise is that the chosen videos and questions validly measure the targeted reasoning skills.

axioms (1)

domain assumption Roadside camera footage contains sufficient visual detail to support reliable human annotation of crash mechanics, causality, and temporal progression.
Invoked implicitly when constructing the Tier 2 questions on crash mechanics and causal attribution.

pith-pipeline@v0.9.0 · 5534 in / 1334 out tokens · 60490 ms · 2026-05-10T17:06:19.443779+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear
phase-aware dense captions... four temporal phases: pre-crash context, collision dynamics, aftermath, and potential causes
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear
two-tier taxonomy... Tier 1 visual grounding... Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

An Agentic Workflow for Detecting Personally Identifiable Information in Crash Narratives
cs.CR 2026-04 unverdicted novelty 6.0

A hybrid agentic workflow using Presidio for structured PII and fine-tuned LLMs plus verification for names, addresses, and identifiers detects PII in crash narratives at 0.82 precision and 0.94 recall.

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene un- derstanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819– 21830, 2024. 2

2024
[2]

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025. 8

work page arXiv 2025
[3]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized lan- guage models.arXiv preprint arXiv:2305.14314, 2023. Ac- cepted to NeurIPS 2023. 5

work page internal anchor Pith review arXiv 2023
[4]

Trafficvlm: A controllable visual lan- guage model for traffic video captioning

Quang Minh Dinh, Minh Khoi Ho, Anh Quan Dang, and Hung Phong Tran. Trafficvlm: A controllable visual lan- guage model for traffic video captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7134–7143, 2024. 2

2024
[5]

Dada-2000: Can driving accident be pre- dicted by driver attentionƒ analyzed by a benchmark

Jianwu Fang, Dingxin Yan, Jiahuan Qiao, Jianru Xue, He Wang, and Sen Li. Dada-2000: Can driving accident be pre- dicted by driver attentionƒ analyzed by a benchmark. In2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 4303–4309. IEEE, 2019. 3

2000
[6]

Abductive ego-view accident video understanding for safe driving per- ception

Jianwu Fang, Lei-lei Li, Junfei Zhou, Junbin Xiao, Hongkai Yu, Chen Lv, Jianru Xue, and Tat-Seng Chua. Abductive ego-view accident video understanding for safe driving per- ception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22030– 22040, 2024. 3

2024
[7]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Zhao Peiyuan, Jia Hao, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5

work page internal anchor Pith review arXiv 2024
[8]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xi- aoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 3302–3310, 2025. 8

2025
[9]

Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding

Younggun Kim, Ahmed S Abdelrahman, and Mohamed Abdel-Aty. Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 761–771,
[10]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProc. ACM SIGOPS Symp. Oper. Syst. Principles (SOSP), 2023. vLLM. 3

2023
[11]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013. 3

2013
[13]

Im- proving llm video understanding with 16 frames per second

Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Im- proving llm video understanding with 16 frames per second. arXiv preprint arXiv:2503.13956, 2025. 8

work page arXiv 2025
[14]

Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els.arXiv preprint arXiv:2505.12589, 2025

Bo Liu, Pengfei Qiao, Minhan Ma, Xuange Zhang, Yi- nan Tang, Peng Xu, Kun Liu, and Tongtong Yuan. Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els.arXiv preprint arXiv:2505.12589, 2025. 1, 3

work page arXiv 2025
[15]

Fu- ture frame prediction for anomaly detection–a new baseline

Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Fu- ture frame prediction for anomaly detection–a new baseline. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 3

2018
[16]

Abnormal event detec- tion at 150 fps in matlab

Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detec- tion at 150 fps in matlab. InProceedings of the IEEE inter- national conference on computer vision, pages 2720–2727,
[17]

A simulation-based frame- work for urban traffic accident detection

Haohan Luo and Feng Wang. A simulation-based frame- work for urban traffic accident detection. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 3

2023
[18]

Conan: Progressive learning to reason like a detective over multi-scale visual ev- idence.arXiv preprint arXiv:2510.20470, 2025

Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Conan: Progressive learning to reason like a detective over multi-scale visual ev- idence.arXiv preprint arXiv:2510.20470, 2025. 8

work page arXiv 2025
[19]

Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from so- cial video narratives

Chirag Parikh, Deepti Rawat, Tathagata Ghosh, Ravi Ki- ran Sarvadevabhatla, et al. Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from so- cial video narratives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19002–19011,
[20]

Cot-vlm4tar: Chain-of-thought guided vision- language models for traffic anomaly resolution.arXiv preprint arXiv:2503.01632, 2025

Tianchi Ren, Haibo Hu, Jiacheng Zuo, Xinhong Chen, Jianping Wang, Chun Jason Xue, Jen-Ming Wu, and Nan Guan. Cot-vlm4tar: Chain-of-thought guided vision- language models for traffic anomaly resolution.arXiv preprint arXiv:2503.01632, 2025. 2

work page arXiv 2025
[21]

Safeplug: Empow- ering multimodal llms with pixel-level insight and temporal grounding for traffic accident understanding.arXiv preprint arXiv:2508.06763, 2025

Zihao Sheng, Zilin Huang, Yansong Qu, Jiancong Chen, Yuhao Luo, Yen-Jung Chen, Yue Leng, and Sikai Chen. Safeplug: Empowering multimodal llms with pixel-level in- sight and temporal grounding for traffic accident understand- ing.arXiv preprint arXiv:2508.06763, 2025. 3

work page arXiv 2025
[22]

Real-world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 1, 3

2018
[23]

Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4567–4576, 2025. 1, 2

2025
[24]

Interact-video: Reasoning-rich video qa for urban traffic.arXiv preprint arXiv:2507.14743, 2025

Joseph Raj Vishal, Divesh Basina, Rutuja Patil, Manas Srini- vas Gowda, Katha Naik, Yezhou Yang, and Bharatesh Chakravarthi. Interact-video: Reasoning-rich video qa for urban traffic.arXiv preprint arXiv:2507.14743, 2025. 3

work page arXiv 2025
[25]

Qwen2.5-VL Technical Report

Peng Wang, Shuai Bai, Sinan Gao, Jialin Wang Gar- cia, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 8

work page internal anchor Pith review arXiv 2025
[27]

DriveQA: Passing the driving knowledge test.arXiv preprint arXiv:2508.21824, 2025

Maolin Wei, Wanzhou Liu, and Eshed Ohn-Bar. DriveQA: Passing the driving knowledge test.arXiv preprint arXiv:2508.21824, 2025. Accepted to ICCV 2025. 1, 2

work page arXiv 2025
[28]

Hazardvlm: A video language model for real- time hazard description in automated driving systems.IEEE Transactions on Intelligent Vehicles, 2024

Dannier Xiao, Mehrdad Dianati, Paul Jennings, and Roger Woodman. Hazardvlm: A video language model for real- time hazard description in automated driving systems.IEEE Transactions on Intelligent Vehicles, 2024. 2

2024
[29]

Sutd-trafficqa: A question answering benchmark and an efficient network for video rea- soning over traffic events

Li Xu, He Huang, and Jun Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video rea- soning over traffic events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9878–9888, 2021. 1, 2, 3

2021
[30]

TAD: A large-scale benchmark for traffic accidents detection from video surveillance.IEEE Access, 13:2018–2033, 2025

Yajun Xu, Chengwei Huang, Yong Nan, and Shiguo Lian. TAD: A large-scale benchmark for traffic accidents detection from video surveillance.IEEE Access, 13:2018–2033, 2025. 3

2018
[31]

V2x-vlm: End-to-end v2x cooperative autonomous driving through large vision-language models.Transportation Re- search Part C: Emerging Technologies, 183:105457, 2026

Junwei You, Zhuoyu Jiang, Zilin Huang, Haotian Shi, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, and Bin Ran. V2x-vlm: End-to-end v2x cooperative autonomous driving through large vision-language models.Transportation Re- search Part C: Emerging Technologies, 183:105457, 2026. 1

2026
[32]

Traffic accident bench- mark for causality recognition

Tackgeun You and Bohyung Han. Traffic accident bench- mark for causality recognition. InEuropean Conference on Computer Vision, pages 540–556. Springer, 2020. 3

2020
[33]

Towards surveillance video-and-language understanding: New dataset baselines and challenges

Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset baselines and challenges. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22052– 22061, 2024. 3

2024
[34]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8995– 9006, 2025. 8

2025
[35]

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Wal- ter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Vennkat- narayanan Lakshminarasimhan, Leah Strand, and Alois C. Knoll. TUMTraffic-VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv preprint arXiv:2502.02449, 2025. 1, 3

work page arXiv 2025
[36]

Tau-106k: A new dataset for comprehensive understanding of traffic accident

Yixuan Zhou, Long Bai, Sijia Cai, Bing Deng, Xing Xu, and Heng Tao Shen. Tau-106k: A new dataset for comprehensive understanding of traffic accident. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. 1, 3

2025
[37]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025