Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
hub Canonical reference
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Canonical reference. 90% of citing Pith papers cite this work as background.
abstract
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
M³Eval is a new cognitively-grounded benchmark that evaluates memory dimensions in multi-modal video models and reports consistent model weaknesses in disentanglement, interference, spatial-temporal grounding, and symbolic recall.
GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.
X-Mind proposes an efficient internal visual chain-of-thought using compressed BEV sketches and recurrent block diffusion to embed predictive world models into end-to-end driving policies.
SceneMiner shows that identity-preserving multi-task fine-tuning removes cross-task interference by zero-initializing new heads and freezing shared-stream parameters, enabling unified BEV scene mining with preserved original heads.
nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.
Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.
VLA driving models show 42.5% reasoning fidelity and 48.3% reasoning-action consistency, with 97.7% trajectory fragility under perturbations.
MAPLE proposes latent multi-agent rollouts with supervised fine-tuning followed by reinforcement learning using safety, progress, interaction, and diversity rewards to enable scalable closed-loop training for end-to-end autonomous driving.
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
citing papers explorer
-
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.
-
Enhancing End-to-End Autonomous Driving with Latent World Model
LAW introduces a self-supervised prediction task on latent scene features that boosts end-to-end driving performance on nuScenes, NAVSIM, and CARLA benchmarks.