hub Canonical reference

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao · 2024 · cs.CV · arXiv 2402.12289

Canonical reference. 90% of citing Pith papers cite this work as background.

70 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 baseline 1 dataset 1

citation-polarity summary

background 19 baseline 1 use dataset 1

representative citing papers

Membership Inference Attacks on Vision-Language-Action Models

cs.CR · 2026-05-08 · unverdicted · novelty 8.0

Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

cs.CV · 2026-04-14 · unverdicted · novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

FleetAgent: Teleoperation Assistant for Autonomous Fleets via Vectorized V2N Messages

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

FleetAgent pairs a vector-to-embedding interface (VecFormer) with an MLLM to turn compact V2N messages into structured natural-language teleoperation assistance, cutting uplink payload 625x and improving Lingo-Judge score 16.8% on a new nuScenes-derived dataset.

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

VLADriveBench combines observational metrics and CoT intervention protocols to evaluate the relevance and causality of reasoning in vision-language-action models for autonomous driving, revealing divergent model behaviors.

Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

cs.CR · 2026-06-10 · unverdicted · novelty 7.0

Introduces Neighbor Leakage Rate showing high trigger leakage in VLAS backdoors at 3% poisoning, caused by broad activation regions in fine-tuning that hard-negative samples can narrow.

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

M³Eval is a new cognitively-grounded benchmark that evaluates memory dimensions in multi-modal video models and reports consistent model weaknesses in disentanglement, interference, spatial-temporal grounding, and symbolic recall.

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.

Grounding Driving VLA via Inverse Kinematics

cs.CV · 2026-05-20 · conditional · novelty 7.0

By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Hyperbolic Concept Bottleneck Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.

Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

cs.RO · 2026-04-23 · unverdicted · novelty 7.0

A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.

Learning Vision-Language-Action World Models for Autonomous Driving

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.

Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

cs.RO · 2026-03-18 · conditional · novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

cs.CV · 2025-06-09 · unverdicted · novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.

Teaching Vision-Language-Action Models What to See and Where to Look

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

DriveTeach-VLA adds Driving-aware Vision Distillation pretraining and 2D Trajectory-Guided Prompts to VLA models, then reports state-of-the-art results on NAVSIM and nuScenes.

X-Mind: Efficient Visual Chain-of-Thought via Predictive World Model for End-to-End Driving

cs.CV · 2026-06-27 · unverdicted · novelty 6.0

X-Mind proposes an efficient internal visual chain-of-thought using compressed BEV sketches and recurrent block diffusion to embed predictive world models into end-to-end driving policies.

OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLMs

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

OmniSpace is a plug-and-play method that improves spatial reasoning in MLLMs for AV by injecting camera pose, using epipolar attention across views, and distilling 3D geometric knowledge to overcome weak cross-view correspondence and depth estimation.

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

VLGA introduces geometry as a fourth modality in VLA models via pointmap regression loss, reporting SOTA open-loop and closed-loop driving metrics on nuScenes and Bench2Drive.

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

DrivingAgent automates autonomous driving module design via LLM code generation and super-network validation, and employs an RL-trained LLM with structured memory for dynamic real-time scheduling.

citing papers explorer

Showing 15 of 15 citing papers after filters.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis cs.CV · 2026-05-21 · unverdicted · none · ref 69 · internal anchor
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
Learning Vision-Language-Action World Models for Autonomous Driving cs.CV · 2026-04-10 · unverdicted · none · ref 53 · internal anchor
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 46 · internal anchor
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 8 · 2 links · internal anchor
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving cs.CV · 2026-05-09 · unverdicted · none · ref 16 · 2 links · internal anchor
VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving cs.CV · 2026-05-01 · unverdicted · none · ref 27 · 2 links · internal anchor
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving cs.CV · 2026-04-22 · unverdicted · none · ref 35 · internal anchor
EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving cs.CV · 2026-04-09 · unverdicted · none · ref 47 · internal anchor
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning cs.CV · 2025-06-16 · unverdicted · none · ref 46 · internal anchor
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to adaptively reduce unnecessary reasoning.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 70 · internal anchor
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model cs.CV · 2026-04-21 · unverdicted · none · ref 66 · internal anchor
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 92 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 132 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 35 · internal anchor
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer