pith. sign in

arxiv: 2605.22185 · v1 · pith:4I23UPQHnew · submitted 2026-05-21 · 💻 cs.CV · cs.LG

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

Pith reviewed 2026-05-22 06:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords multimodal large language modelssafety-critical eventsdriving video analysispseudo-labelstelematics dataDoRA adapterscomputer vision fusionQwenVL-2.5
0
0 comments X

The pith

Fusing telematics with video frames and CV models generates pseudo-labels that improve MLLMs for safety-critical driving event detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline to make multimodal large language models better at perceiving and reasoning about rare high-stakes events in driving videos. It combines downsampled video frames with high-frequency IMU and GPS telematics plus outputs from specialized computer vision models to create descriptive captions and question-answer pairs as pseudo-labels. These labels train an open-source model through efficient fine-tuning. A sympathetic reader would care because accurate analysis of near-collisions and similar incidents could support safer vehicle systems. The method achieves measurable gains while keeping trainable parameters below 50 million and compute demands low.

Core claim

By fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models, the pipeline generates high-quality pseudo-labels including descriptive captions and question-answer pairs to train MLLMs to identify and describe Safety-Critical Events in real-world driving footage. Fine-tuning the QwenVL-2.5 model via DoRA adapters produces significant improvements in identifying and explaining these events with fewer than 50M trainable parameters and limited computational budget.

What carries the argument

The fusion pipeline that generates pseudo-labels from downsampled frames, telematics data, and specialized CV model outputs for MLLM training on safety-critical events.

If this is right

  • MLLMs achieve significant improvements in identifying and explaining safety-critical events such as collisions or near-collisions.
  • Training succeeds with fewer than 50 million trainable parameters.
  • The approach operates under a limited computational budget.
  • Open-source models like QwenVL-2.5 can be adapted for domain-specific safety analysis in driving footage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fusion method could scale training data creation for other rare-event video tasks where manual labels are scarce.
  • Integrating similar sensor fusion might strengthen real-time perception modules in autonomous driving stacks.
  • Applying the pipeline to additional domains like industrial monitoring could test its generality beyond driving.

Load-bearing premise

The pseudo-labels produced by fusing downsampled frames, telematics, and specialized computer vision models are sufficiently accurate and unbiased to serve as effective training targets without introducing systematic errors.

What would settle it

Testing the fine-tuned MLLM on a separate set of human-annotated driving videos and measuring whether its accuracy on safety-critical event identification and explanation drops substantially below the reported gains.

Figures

Figures reproduced from arXiv: 2605.22185 by Henrique Pi\~neiro Monteagudo, Leonardo Taccari, Tomaso Trinci.

Figure 1
Figure 1. Figure 1: Current MLLMs struggle to identify Safety-Critical Events. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level overview of our proposed solution. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pseudo-labels generated with our proposed pipeline, corresponding to the same event depicted in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a pipeline that fuses downsampled driving video frames with synchronized telematics (IMU/GPS) and outputs from off-the-shelf CV models to generate pseudo-labels (captions and QA pairs) for safety-critical events (SCEs). These labels are then used to fine-tune the open-source QwenVL-2.5 MLLM via DoRA adapters, with the central claim being significant improvements in identifying and explaining rare high-stakes events using fewer than 50M trainable parameters and limited compute.

Significance. If the pseudo-labels prove reliable, the work offers a practical, parameter-efficient route to specialize MLLMs for safety-critical perception in driving, an area where general MLLMs currently struggle with rare dynamic events. The emphasis on low-resource adaptation is a constructive contribution to the field.

major comments (1)
  1. [Method / Pseudo-label Generation] Pseudo-label generation pipeline: The manuscript relies on the fusion of downsampled frames, telematics, and specialized CV models to produce training targets, yet provides no quantitative validation (e.g., agreement rates, precision/recall) of these pseudo-labels against independent human expert annotations on a held-out set of safety-critical events. This is load-bearing because systematic under-detection of near-collisions by the CV models would cause the fine-tuned MLLM to reproduce rather than correct those errors.
minor comments (1)
  1. [Abstract] Abstract: The claim of 'significant improvements' is stated without any numerical metrics, baseline comparisons, dataset sizes, or error bars, which reduces the immediate informativeness of the summary.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation of our pseudo-labeling approach. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Method / Pseudo-label Generation] Pseudo-label generation pipeline: The manuscript relies on the fusion of downsampled frames, telematics, and specialized CV models to produce training targets, yet provides no quantitative validation (e.g., agreement rates, precision/recall) of these pseudo-labels against independent human expert annotations on a held-out set of safety-critical events. This is load-bearing because systematic under-detection of near-collisions by the CV models would cause the fine-tuned MLLM to reproduce rather than correct those errors.

    Authors: We agree that direct quantitative validation of the pseudo-labels is important for establishing their reliability. Our pipeline combines high-frequency telematics (which provides objective kinematic signals for near-collision detection) with CV model outputs and downsampled frames to mitigate individual model weaknesses, and we observe strong downstream gains on held-out real-world SCE detection tasks. However, we acknowledge the absence of a reported human-expert agreement study on a held-out set. In the revised version we will add a dedicated validation subsection that reports inter-annotator agreement and precision/recall of a random sample of generated captions and QA pairs against two independent driving-safety experts. This will directly address the concern about potential systematic under-detection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning with independent evaluation

full rationale

The manuscript describes an empirical pipeline for generating pseudo-labels via data fusion and then fine-tuning QwenVL-2.5 with DoRA adapters. No equations, predictions, or first-principles derivations are present that reduce reported performance metrics to quantities defined by the same fitted parameters or self-citations. The central claims rest on experimental improvements measured against held-out test data rather than self-referential definitions. The work is self-contained against external benchmarks because the evaluation of safety-critical event identification is performed on real-world driving footage using standard metrics, with no load-bearing step that collapses to an input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that existing CV models and telematics sensors provide reliable auxiliary signals, but these are treated as black-box inputs rather than newly postulated entities.

pith-pipeline@v0.9.0 · 5705 in / 1178 out tokens · 29873 ms · 2026-05-22T06:32:36.949345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

  1. [1]

    Scvlm: Enhancing vision- language model for safety-critical event understanding,

    L. Shi, B. Jiang, T. Zeng, and F. Guo, “Scvlm: Enhancing vision- language model for safety-critical event understanding,” inProceed- ings of the Winter Conference on Applications of Computer Vision, 2025, pp. 1061–1071

  2. [2]

    Detection of stop sign violations from dashcam data,

    L. Bravi, L. Kubin, S. Caprasecca, D. C. de Andrade, M. Simoncini, L. Taccari, and F. Sambo, “Detection of stop sign violations from dashcam data,”IEEE transactions on intelligent transportation sys- tems, vol. 23, no. 6, pp. 5411–5420, 2021

  3. [3]

    Color is not enough: Dataset and method for identify- ing relevant traffic lights in driving scenes,

    T. Trinci, S. Magistri, T. Bianconcini, L. Taccari, L. Sarti, and F. Sambo, “Color is not enough: Dataset and method for identify- ing relevant traffic lights in driving scenes,”IEEE Transactions on Intelligent Transportation Systems, 2025

  4. [4]

    Qwen3-VL Technical Report

    S. Baiet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  6. [6]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    C. Clarket al., “Molmo2: Open weights and data for vision-language models with video understanding and grounding,”arXiv preprint arXiv:2601.10611, 2026

  7. [7]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    T. Yuet al., “Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe,”arXiv preprint arXiv:2509.18154, 2025

  8. [8]

    Roformer: En- hanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

  9. [9]

    Synshrp2: A synthetic multimodal benchmark for driving safety-critical events derived from real-world driving data,

    L. Shi, B. Jiang, Z. Yuan, M. A. Perez, and F. Guo, “Synshrp2: A synthetic multimodal benchmark for driving safety-critical events derived from real-world driving data,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4586–4596

  10. [10]

    Lingoqa: Visual question answering for autonomous driving,

    A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton et al., “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 252–269

  11. [11]

    Drivelm: Driving with graph visual question answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

  12. [12]

    Textual explanations for self-driving vehicles,

    J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,”Proceedings of the European Conference on Computer Vision (ECCV), 2018

  13. [13]

    Description of the shrp 2 naturalistic database and the crash, near-crash, and baseline data sets,

    J. M. Hankey, M. A. Perez, and J. A. McClafferty, “Description of the shrp 2 naturalistic database and the crash, near-crash, and baseline data sets,” Virginia Tech Transportation Institute, Tech. Rep., 2016

  14. [14]

    Dota: unsupervised detection of traffic anomaly in driving videos,

    Y . Yao, X. Wang, M. Xu, Z. Pu, Y . Wang, E. Atkins, and D. Crandall, “Dota: unsupervised detection of traffic anomaly in driving videos,” IEEE transactions on pattern analysis and machine intelligence, 2022

  15. [15]

    Nexar dashcam collision prediction dataset and challenge,

    D. Moura, S. Zhu, and O. Zvitia, “Nexar dashcam collision prediction dataset and challenge,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2583–2591

  16. [16]

    Deep crash detection from vehicular sensor data with multimodal self-supervision,

    L. Kubin, T. Bianconcini, D. C. de Andrade, M. Simoncini, L. Taccari, and F. Sambo, “Deep crash detection from vehicular sensor data with multimodal self-supervision,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 12 480–12 489, 2021

  17. [17]

    Real-time driving risk assessment using deep learning with xgboost,

    L. Shi, C. Qian, and F. Guo, “Real-time driving risk assessment using deep learning with xgboost,”Accident Analysis & Prevention, vol. 178, p. 106836, 2022

  18. [18]

    Unsafe maneuver classification from dashcam video and gps/imu sensors using spatio-temporal attention selector,

    M. Simoncini, D. C. de Andrade, L. Taccari, S. Salti, L. Kubin, F. Schoen, and F. Sambo, “Unsafe maneuver classification from dashcam video and gps/imu sensors using spatio-temporal attention selector,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 15 605–15 615, 2022

  19. [19]

    Tld-ready: Traffic light detection-relevance estimation and deployment analysis,

    N. Polley, S. Pavlitska, Y . Boualili, P. Rohrbeck, P. Stiller, A. K. Ban- garu, and J. M. Zollnerl, “Tld-ready: Traffic light detection-relevance estimation and deployment analysis,” in2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2024, pp. 3800–3806

  20. [20]

    Dora: Weight-decomposed low-rank adaptation,

    S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.- T. Cheng, and M.-H. Chen, “Dora: Weight-decomposed low-rank adaptation,” inInternational Conference on Machine Learning, 2024

  21. [21]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  22. [22]

    NEFTune: Noisy embeddings improve instruction finetuning,

    N. Jainet al., “NEFTune: Noisy embeddings improve instruction finetuning,” inThe Twelfth International Conference on Learning Representations, 2024

  23. [23]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning,

    M. Assranet al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,” 2025