Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

Henrique Pi\~neiro Monteagudo; Leonardo Taccari; Tomaso Trinci

arxiv: 2605.22185 · v1 · pith:4I23UPQHnew · submitted 2026-05-21 · 💻 cs.CV · cs.LG

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

Tomaso Trinci , Henrique Pi\~neiro Monteagudo , Leonardo Taccari This is my paper

Pith reviewed 2026-05-22 06:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords multimodal large language modelssafety-critical eventsdriving video analysispseudo-labelstelematics dataDoRA adapterscomputer vision fusionQwenVL-2.5

0 comments

The pith

Fusing telematics with video frames and CV models generates pseudo-labels that improve MLLMs for safety-critical driving event detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline to make multimodal large language models better at perceiving and reasoning about rare high-stakes events in driving videos. It combines downsampled video frames with high-frequency IMU and GPS telematics plus outputs from specialized computer vision models to create descriptive captions and question-answer pairs as pseudo-labels. These labels train an open-source model through efficient fine-tuning. A sympathetic reader would care because accurate analysis of near-collisions and similar incidents could support safer vehicle systems. The method achieves measurable gains while keeping trainable parameters below 50 million and compute demands low.

Core claim

By fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models, the pipeline generates high-quality pseudo-labels including descriptive captions and question-answer pairs to train MLLMs to identify and describe Safety-Critical Events in real-world driving footage. Fine-tuning the QwenVL-2.5 model via DoRA adapters produces significant improvements in identifying and explaining these events with fewer than 50M trainable parameters and limited computational budget.

What carries the argument

The fusion pipeline that generates pseudo-labels from downsampled frames, telematics data, and specialized CV model outputs for MLLM training on safety-critical events.

If this is right

MLLMs achieve significant improvements in identifying and explaining safety-critical events such as collisions or near-collisions.
Training succeeds with fewer than 50 million trainable parameters.
The approach operates under a limited computational budget.
Open-source models like QwenVL-2.5 can be adapted for domain-specific safety analysis in driving footage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fusion method could scale training data creation for other rare-event video tasks where manual labels are scarce.
Integrating similar sensor fusion might strengthen real-time perception modules in autonomous driving stacks.
Applying the pipeline to additional domains like industrial monitoring could test its generality beyond driving.

Load-bearing premise

The pseudo-labels produced by fusing downsampled frames, telematics, and specialized computer vision models are sufficiently accurate and unbiased to serve as effective training targets without introducing systematic errors.

What would settle it

Testing the fine-tuned MLLM on a separate set of human-annotated driving videos and measuring whether its accuracy on safety-critical event identification and explanation drops substantially below the reported gains.

Figures

Figures reproduced from arXiv: 2605.22185 by Henrique Pi\~neiro Monteagudo, Leonardo Taccari, Tomaso Trinci.

**Figure 2.** Figure 2: High-level overview of our proposed solution. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pseudo-labels generated with our proposed pipeline, corresponding to the same event depicted in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical recipe for fusing telematics and CV outputs to make pseudo-labels for rare safety-critical events, then fine-tunes an MLLM on them, but supplies no numbers or label validation to show the gains are real.

read the letter

The main takeaway is that the authors built a pipeline to create training data for MLLMs on low-frequency high-stakes driving events. They combine downsampled video frames with synchronized IMU and GPS readings plus semantic outputs from specialized CV models, then generate focused captions and QA pairs for safety-critical events. They apply this to fine-tune QwenVL-2.5 with DoRA adapters using under 50M parameters and limited compute. That specific data-generation step for SCEs is the concrete new element here, even though the individual pieces like fusion and adapters are established. The paper does well by keeping the method lightweight and aimed at a genuine deployment bottleneck in autonomous systems where general MLLMs miss rare cases. The engineering focus on real-world constraints is useful. The soft spots are clear. The abstract claims significant improvements in identifying and explaining these events but gives no metrics, baselines, dataset sizes, or error analysis, so it is impossible to judge whether the gains hold up or depend on particular choices. More critically, the pseudo-labels rest on the assumption that the CV models and fusion produce accurate, unbiased targets. Current detectors often miss or misclassify near-collisions, and the work reports no quantitative check of the generated labels against human expert annotations on a held-out set of critical events. If those labels carry systematic errors, the fine-tuned model will simply reproduce them. The paper is empirical rather than theoretical, with standard citations for the components it reuses. This is for researchers and engineers working on multimodal models for autonomous driving or safety analysis. A reader looking for concrete ways to bootstrap training data on rare events will get value from the pipeline description. It deserves a serious referee because the problem matters for safety and the method is a clear practical contribution, provided the full experiments and validation are included.

Referee Report

1 major / 1 minor

Summary. The paper introduces a pipeline that fuses downsampled driving video frames with synchronized telematics (IMU/GPS) and outputs from off-the-shelf CV models to generate pseudo-labels (captions and QA pairs) for safety-critical events (SCEs). These labels are then used to fine-tune the open-source QwenVL-2.5 MLLM via DoRA adapters, with the central claim being significant improvements in identifying and explaining rare high-stakes events using fewer than 50M trainable parameters and limited compute.

Significance. If the pseudo-labels prove reliable, the work offers a practical, parameter-efficient route to specialize MLLMs for safety-critical perception in driving, an area where general MLLMs currently struggle with rare dynamic events. The emphasis on low-resource adaptation is a constructive contribution to the field.

major comments (1)

[Method / Pseudo-label Generation] Pseudo-label generation pipeline: The manuscript relies on the fusion of downsampled frames, telematics, and specialized CV models to produce training targets, yet provides no quantitative validation (e.g., agreement rates, precision/recall) of these pseudo-labels against independent human expert annotations on a held-out set of safety-critical events. This is load-bearing because systematic under-detection of near-collisions by the CV models would cause the fine-tuned MLLM to reproduce rather than correct those errors.

minor comments (1)

[Abstract] Abstract: The claim of 'significant improvements' is stated without any numerical metrics, baseline comparisons, dataset sizes, or error bars, which reduces the immediate informativeness of the summary.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation of our pseudo-labeling approach. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Method / Pseudo-label Generation] Pseudo-label generation pipeline: The manuscript relies on the fusion of downsampled frames, telematics, and specialized CV models to produce training targets, yet provides no quantitative validation (e.g., agreement rates, precision/recall) of these pseudo-labels against independent human expert annotations on a held-out set of safety-critical events. This is load-bearing because systematic under-detection of near-collisions by the CV models would cause the fine-tuned MLLM to reproduce rather than correct those errors.

Authors: We agree that direct quantitative validation of the pseudo-labels is important for establishing their reliability. Our pipeline combines high-frequency telematics (which provides objective kinematic signals for near-collision detection) with CV model outputs and downsampled frames to mitigate individual model weaknesses, and we observe strong downstream gains on held-out real-world SCE detection tasks. However, we acknowledge the absence of a reported human-expert agreement study on a held-out set. In the revised version we will add a dedicated validation subsection that reports inter-annotator agreement and precision/recall of a random sample of generated captions and QA pairs against two independent driving-safety experts. This will directly address the concern about potential systematic under-detection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning with independent evaluation

full rationale

The manuscript describes an empirical pipeline for generating pseudo-labels via data fusion and then fine-tuning QwenVL-2.5 with DoRA adapters. No equations, predictions, or first-principles derivations are present that reduce reported performance metrics to quantities defined by the same fitted parameters or self-citations. The central claims rest on experimental improvements measured against held-out test data rather than self-referential definitions. The work is self-contained against external benchmarks because the evaluation of safety-critical event identification is performed on real-world driving footage using standard metrics, with no load-bearing step that collapses to an input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that existing CV models and telematics sensors provide reliable auxiliary signals, but these are treated as black-box inputs rather than newly postulated entities.

pith-pipeline@v0.9.0 · 5705 in / 1178 out tokens · 29873 ms · 2026-05-22T06:32:36.949345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

[1]

Scvlm: Enhancing vision- language model for safety-critical event understanding,

L. Shi, B. Jiang, T. Zeng, and F. Guo, “Scvlm: Enhancing vision- language model for safety-critical event understanding,” inProceed- ings of the Winter Conference on Applications of Computer Vision, 2025, pp. 1061–1071

work page 2025
[2]

Detection of stop sign violations from dashcam data,

L. Bravi, L. Kubin, S. Caprasecca, D. C. de Andrade, M. Simoncini, L. Taccari, and F. Sambo, “Detection of stop sign violations from dashcam data,”IEEE transactions on intelligent transportation sys- tems, vol. 23, no. 6, pp. 5411–5420, 2021

work page 2021
[3]

Color is not enough: Dataset and method for identify- ing relevant traffic lights in driving scenes,

T. Trinci, S. Magistri, T. Bianconcini, L. Taccari, L. Sarti, and F. Sambo, “Color is not enough: Dataset and method for identify- ing relevant traffic lights in driving scenes,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025
[4]

Qwen3-VL Technical Report

S. Baiet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

C. Clarket al., “Molmo2: Open weights and data for vision-language models with video understanding and grounding,”arXiv preprint arXiv:2601.10611, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

T. Yuet al., “Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe,”arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Roformer: En- hanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

work page 2024
[9]

Synshrp2: A synthetic multimodal benchmark for driving safety-critical events derived from real-world driving data,

L. Shi, B. Jiang, Z. Yuan, M. A. Perez, and F. Guo, “Synshrp2: A synthetic multimodal benchmark for driving safety-critical events derived from real-world driving data,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4586–4596

work page 2025
[10]

Lingoqa: Visual question answering for autonomous driving,

A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton et al., “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 252–269

work page 2024
[11]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

work page 2024
[12]

Textual explanations for self-driving vehicles,

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,”Proceedings of the European Conference on Computer Vision (ECCV), 2018

work page 2018
[13]

Description of the shrp 2 naturalistic database and the crash, near-crash, and baseline data sets,

J. M. Hankey, M. A. Perez, and J. A. McClafferty, “Description of the shrp 2 naturalistic database and the crash, near-crash, and baseline data sets,” Virginia Tech Transportation Institute, Tech. Rep., 2016

work page 2016
[14]

Dota: unsupervised detection of traffic anomaly in driving videos,

Y . Yao, X. Wang, M. Xu, Z. Pu, Y . Wang, E. Atkins, and D. Crandall, “Dota: unsupervised detection of traffic anomaly in driving videos,” IEEE transactions on pattern analysis and machine intelligence, 2022

work page 2022
[15]

Nexar dashcam collision prediction dataset and challenge,

D. Moura, S. Zhu, and O. Zvitia, “Nexar dashcam collision prediction dataset and challenge,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2583–2591

work page 2025
[16]

Deep crash detection from vehicular sensor data with multimodal self-supervision,

L. Kubin, T. Bianconcini, D. C. de Andrade, M. Simoncini, L. Taccari, and F. Sambo, “Deep crash detection from vehicular sensor data with multimodal self-supervision,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 12 480–12 489, 2021

work page 2021
[17]

Real-time driving risk assessment using deep learning with xgboost,

L. Shi, C. Qian, and F. Guo, “Real-time driving risk assessment using deep learning with xgboost,”Accident Analysis & Prevention, vol. 178, p. 106836, 2022

work page 2022
[18]

Unsafe maneuver classification from dashcam video and gps/imu sensors using spatio-temporal attention selector,

M. Simoncini, D. C. de Andrade, L. Taccari, S. Salti, L. Kubin, F. Schoen, and F. Sambo, “Unsafe maneuver classification from dashcam video and gps/imu sensors using spatio-temporal attention selector,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 15 605–15 615, 2022

work page 2022
[19]

Tld-ready: Traffic light detection-relevance estimation and deployment analysis,

N. Polley, S. Pavlitska, Y . Boualili, P. Rohrbeck, P. Stiller, A. K. Ban- garu, and J. M. Zollnerl, “Tld-ready: Traffic light detection-relevance estimation and deployment analysis,” in2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2024, pp. 3800–3806

work page 2024
[20]

Dora: Weight-decomposed low-rank adaptation,

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.- T. Cheng, and M.-H. Chen, “Dora: Weight-decomposed low-rank adaptation,” inInternational Conference on Machine Learning, 2024

work page 2024
[21]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

NEFTune: Noisy embeddings improve instruction finetuning,

N. Jainet al., “NEFTune: Noisy embeddings improve instruction finetuning,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[23]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning,

M. Assranet al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,” 2025

work page 2025

[1] [1]

Scvlm: Enhancing vision- language model for safety-critical event understanding,

L. Shi, B. Jiang, T. Zeng, and F. Guo, “Scvlm: Enhancing vision- language model for safety-critical event understanding,” inProceed- ings of the Winter Conference on Applications of Computer Vision, 2025, pp. 1061–1071

work page 2025

[2] [2]

Detection of stop sign violations from dashcam data,

L. Bravi, L. Kubin, S. Caprasecca, D. C. de Andrade, M. Simoncini, L. Taccari, and F. Sambo, “Detection of stop sign violations from dashcam data,”IEEE transactions on intelligent transportation sys- tems, vol. 23, no. 6, pp. 5411–5420, 2021

work page 2021

[3] [3]

Color is not enough: Dataset and method for identify- ing relevant traffic lights in driving scenes,

T. Trinci, S. Magistri, T. Bianconcini, L. Taccari, L. Sarti, and F. Sambo, “Color is not enough: Dataset and method for identify- ing relevant traffic lights in driving scenes,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025

[4] [4]

Qwen3-VL Technical Report

S. Baiet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

C. Clarket al., “Molmo2: Open weights and data for vision-language models with video understanding and grounding,”arXiv preprint arXiv:2601.10611, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

T. Yuet al., “Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe,”arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Roformer: En- hanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

work page 2024

[9] [9]

Synshrp2: A synthetic multimodal benchmark for driving safety-critical events derived from real-world driving data,

L. Shi, B. Jiang, Z. Yuan, M. A. Perez, and F. Guo, “Synshrp2: A synthetic multimodal benchmark for driving safety-critical events derived from real-world driving data,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4586–4596

work page 2025

[10] [10]

Lingoqa: Visual question answering for autonomous driving,

A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton et al., “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 252–269

work page 2024

[11] [11]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

work page 2024

[12] [12]

Textual explanations for self-driving vehicles,

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,”Proceedings of the European Conference on Computer Vision (ECCV), 2018

work page 2018

[13] [13]

Description of the shrp 2 naturalistic database and the crash, near-crash, and baseline data sets,

J. M. Hankey, M. A. Perez, and J. A. McClafferty, “Description of the shrp 2 naturalistic database and the crash, near-crash, and baseline data sets,” Virginia Tech Transportation Institute, Tech. Rep., 2016

work page 2016

[14] [14]

Dota: unsupervised detection of traffic anomaly in driving videos,

Y . Yao, X. Wang, M. Xu, Z. Pu, Y . Wang, E. Atkins, and D. Crandall, “Dota: unsupervised detection of traffic anomaly in driving videos,” IEEE transactions on pattern analysis and machine intelligence, 2022

work page 2022

[15] [15]

Nexar dashcam collision prediction dataset and challenge,

D. Moura, S. Zhu, and O. Zvitia, “Nexar dashcam collision prediction dataset and challenge,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2583–2591

work page 2025

[16] [16]

Deep crash detection from vehicular sensor data with multimodal self-supervision,

L. Kubin, T. Bianconcini, D. C. de Andrade, M. Simoncini, L. Taccari, and F. Sambo, “Deep crash detection from vehicular sensor data with multimodal self-supervision,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 12 480–12 489, 2021

work page 2021

[17] [17]

Real-time driving risk assessment using deep learning with xgboost,

L. Shi, C. Qian, and F. Guo, “Real-time driving risk assessment using deep learning with xgboost,”Accident Analysis & Prevention, vol. 178, p. 106836, 2022

work page 2022

[18] [18]

Unsafe maneuver classification from dashcam video and gps/imu sensors using spatio-temporal attention selector,

M. Simoncini, D. C. de Andrade, L. Taccari, S. Salti, L. Kubin, F. Schoen, and F. Sambo, “Unsafe maneuver classification from dashcam video and gps/imu sensors using spatio-temporal attention selector,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 15 605–15 615, 2022

work page 2022

[19] [19]

Tld-ready: Traffic light detection-relevance estimation and deployment analysis,

N. Polley, S. Pavlitska, Y . Boualili, P. Rohrbeck, P. Stiller, A. K. Ban- garu, and J. M. Zollnerl, “Tld-ready: Traffic light detection-relevance estimation and deployment analysis,” in2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2024, pp. 3800–3806

work page 2024

[20] [20]

Dora: Weight-decomposed low-rank adaptation,

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.- T. Cheng, and M.-H. Chen, “Dora: Weight-decomposed low-rank adaptation,” inInternational Conference on Machine Learning, 2024

work page 2024

[21] [21]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

NEFTune: Noisy embeddings improve instruction finetuning,

N. Jainet al., “NEFTune: Noisy embeddings improve instruction finetuning,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[23] [23]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning,

M. Assranet al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,” 2025

work page 2025