pith. machine review for the scientific record. sign in

arxiv: 2604.19710 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Jiaqi Ma, Kateryna Pistunova, Lili Su, Ruining Yang, Sherry X. Chen, Tao Feng, Xuewei (Tony) Qi, Yiluan Guo, Yishan Shen, Zewei Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords Vision-Language-ActionAutonomous DrivingFlow MatchingGRPO Post-trainingNegative-Recovery SamplesTrajectory PlanningReasoning Dataset
0
0 comments X

The pith

SpanVLA bridges vision-language reasoning to a flow-matching policy conditioned on historical trajectories to generate driving actions faster while learning recoveries from negative examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpanVLA as a framework that pairs autoregressive vision-language reasoning with a flow-matching action expert for end-to-end autonomous driving. It creates an efficient bridge so that VLM guidance and past trajectory data condition the flow-matching policy, cutting the time needed to plan future paths. The work also adds GRPO-based post-training on a new dataset called mReasoning that includes negative-recovery samples, letting the model learn both good behaviors and how to avoid or fix typical mistakes. Experiments on NAVSIM benchmarks show competitive planning results and better handling of complex scenarios. The central goal is to overcome the latency and fragility that limit current vision-language-action models in long-tail driving situations.

Core claim

SpanVLA integrates an autoregressive VLM for reasoning with a flow-matching policy for action generation. The efficient bridge conditions the flow-matching policy on historical trajectory initialization to leverage VLM vision and reasoning guidance, which reduces inference time. GRPO post-training on the mReasoning dataset enables learning from both positive samples and negative-recovery behaviors, improving robustness in reasoning-demanding scenarios. This yields competitive performance on NAVSIM v1 and v2.

What carries the argument

The efficient bridge that conditions a flow-matching policy on historical trajectory initialization and VLM guidance to plan future trajectories.

Load-bearing premise

Conditioning the flow-matching policy on historical trajectory initialization plus VLM guidance will produce safe trajectories across real-world driving distributions without extra constraints.

What would settle it

A recorded driving sequence in which the generated trajectory collides or fails to recover in a scenario covered by the negative-recovery training data.

Figures

Figures reproduced from arXiv: 2604.19710 by Jiaqi Ma, Kateryna Pistunova, Lili Su, Ruining Yang, Sherry X. Chen, Tao Feng, Xuewei (Tony) Qi, Yiluan Guo, Yishan Shen, Zewei Zhou.

Figure 1
Figure 1. Figure 1: SpanVLA is a novel end-to-end autonomous driving framework, integrating the autoregressive reasoning and flow-matching action expert. It leverages a vision￾language model (VLM) with chain-of-thought reasoning as the backbone, and intro￾duces an efficient bridge to extract the multi-granular features from the VLM. More￾over, a flow-matching action expert is introduced to efficiently generate a continuous tr… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the efficient action bridging of the SpanVLA model. The VLM backbone leverages the autoregressive decoding to generate the reasoning results, and we introduce an action bridging to utilize the sparse KV cache to efficiently generate the continuous trajectory with historical initialization based on flow-matching, avoiding the linearly increasing latency of the autoregressive decoding with the lo… view at source ↗
Figure 3
Figure 3. Figure 3: mReasoning data distribution and typical negative-recovery samples. During training, to enable the model to learn how to reason for planning, we introduce an additional discrete action generation task following reasoning in the VLM, which unifies reasoning and planning within the SFT, as following: [\mathcal {T}_{\text {Reason}}, (A_{\text {token}})] = \mathrm {VLM}(\mathcal {V}^t, \mathcal {T}^t); A_{\tex… view at source ↗
Figure 4
Figure 4. Figure 4: RFT results of SpanVLA in the nuPlan dataset. (a) Comparison of PDMS among different settings of RFT training samples; (b) Qualitative comparison of plan￾ning and reasoning performance in positive samples before and after RFT [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RFT Data-recipe comparison. Blue: fixed 2K positive warm-up (w), varying positive (p)/negative (n) ratios in the remaining samples. Red: fixed warm-up and positives, replacing negatives with recovery (r, partial to full). Green: no warm-up (replaced by positives), adding either negatives or recovery. Effect of Negative Penalty and Recovery Reward. We study the effect of the proposed negative-behavior penal… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of planning and reasoning performance of SpanVLA in negative samples of mReasoning before and after RFT with negative-recovery samples. Left Ground Truth Planning Ground Truth Planning Front Right Back Ground Truth Planning Ground Truth Planning Left Front Right Back Slow Thinking <think> This is a complex scenario requiring additional reasoning. During the right turn, cones and a keep right sig… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of planning and reasoning performance of SpanVLA in recovery samples of mReasoning before and after RFT with negative-recovery samples. 5 Conclusion We proposed SpanVLA, a VLA framework equipped with an efficient action bridge and learned from real-world negative-recovery samples for autonomous driving. To overcome the linearly increasing latency of autoregressive decod￾ing, SpanVLA integrates a… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpanVLA, an end-to-end Vision-Language-Action framework for autonomous driving that combines autoregressive VLM reasoning with a flow-matching action expert. It introduces an efficient bridge to condition a flow-matching policy on VLM guidance and historical trajectory initialization for reduced inference latency, a GRPO-based post-training procedure to learn from negative-recovery samples in addition to positive ones, and the mReasoning dataset focused on complex reasoning and recovery scenarios. Experiments are reported to show competitive performance on NAVSIM v1 and v2 with qualitative robustness gains.

Significance. If the quantitative claims hold with proper validation, the work could advance efficient VLA models for driving by demonstrating a practical bridge between VLMs and flow-matching policies plus negative-sample post-training, potentially aiding long-tail scenario handling. The mReasoning dataset may provide a useful resource for reasoning-focused driving research.

major comments (2)
  1. Abstract: The claim of 'competitive performance' on NAVSIM v1/v2 and 'improved robustness' is asserted without any quantitative metrics, baselines, error bars, ablation results, or statistical details, preventing assessment of whether the efficient bridge or GRPO components deliver measurable gains over prior VLA methods.
  2. The central construction (efficient bridge + flow-matching policy conditioned on historical trajectories and VLM output): No analysis is provided of out-of-distribution failure modes, mode collapse risks, or uncertainty quantification for the learned conditional distribution, which is load-bearing for the robustness and safety claims across real-world driving distributions.
minor comments (2)
  1. Abstract: The description of the 'efficient bridge' and 'GRPO-based post-training' would benefit from a high-level diagram or pseudocode to clarify the integration of autoregressive reasoning with the flow-matching expert.
  2. The introduction of mReasoning is noted as a contribution, but the abstract does not specify its size, collection protocol, or how negative-recovery samples are annotated, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve the presentation of results and analysis.

read point-by-point responses
  1. Referee: Abstract: The claim of 'competitive performance' on NAVSIM v1/v2 and 'improved robustness' is asserted without any quantitative metrics, baselines, error bars, ablation results, or statistical details, preventing assessment of whether the efficient bridge or GRPO components deliver measurable gains over prior VLA methods.

    Authors: We agree that the abstract is high-level and does not include specific metrics. The full manuscript reports quantitative results on NAVSIM v1 and v2 with baseline comparisons, ablations for the bridge and GRPO components, and qualitative robustness examples in Section 4. We will revise the abstract to include key performance numbers and references to the supporting experiments and ablations. revision: yes

  2. Referee: The central construction (efficient bridge + flow-matching policy conditioned on historical trajectories and VLM output): No analysis is provided of out-of-distribution failure modes, mode collapse risks, or uncertainty quantification for the learned conditional distribution, which is load-bearing for the robustness and safety claims across real-world driving distributions.

    Authors: The manuscript supports robustness claims primarily through the GRPO post-training on negative-recovery samples and the mReasoning dataset, with qualitative results across diverse scenarios. No dedicated quantitative analysis of OOD failure modes, mode collapse, or uncertainty quantification is currently included. We will add a new discussion subsection addressing these aspects, including limitations and future work, to better substantiate the safety-related claims. revision: yes

Circularity Check

0 steps flagged

No circularity: SpanVLA framework and training are additive proposals validated on external benchmarks.

full rationale

The paper introduces SpanVLA as a composite architecture (autoregressive VLM reasoning bridged to a flow-matching policy conditioned on historical trajectories, plus GRPO post-training on negative-recovery samples and a new mReasoning dataset). No equations, derivations, or self-referential definitions appear in the provided abstract or description that reduce claimed performance gains to quantities fitted from the same data by construction. Results are reported as empirical outcomes on NAVSIM v1/v2 rather than as logical consequences of prior fitted parameters or self-citations. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities beyond the high-level model components can be extracted or verified.

pith-pipeline@v0.9.0 · 5567 in / 1049 out tokens · 36982 ms · 2026-05-10T02:31:11.858900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems

    cs.RO 2026-05 unverdicted novelty 7.0

    MDrive benchmark shows multi-agent cooperative driving systems generally outperform single-agent ones in closed-loop settings but perception sharing does not always improve planning and negotiation can harm performanc...

  2. DriveFuture: Future-Aware Latent World Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

Reference graph

Works this paper leans on

86 extracted references · 58 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vi- sion (WACV)

    Arai, H., Miwa, K., Sasaki, K., Watanabe, K., Yamaguchi, Y., Aoki, S., Ya- mamoto, I.: Covla: Comprehensive vision-language-action dataset for autonomous driving. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vi- sion (WACV). pp. 1933–1943. IEEE (2025)

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  3. [3]

    Driving with regulation: Interpretable decision-making for autonomous vehicles with retrieval-augmented reasoning via llm.arXiv preprint arXiv:2410.04759, 2024

    Cai, T., Liu, Y., Zhou, Z., Ma, H., Zhao, S.Z., Wu, Z., Ma, J.: Driving with regulation: Interpretable decision-making for autonomous vehicles with retrieval- augmented reasoning via llm. arXiv preprint arXiv:2410.04759 (2024)

  4. [4]

    Cao, W., Hallgarten, M., Li, T., Dauner, D., Gu, X., Wang, C., Miron, Y., Aiello, M.,Li,H.,Gilitschenski,I.,etal.:Pseudo-simulationforautonomousdriving.arXiv preprint arXiv:2506.04218 (2025)

  5. [5]

    IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (2024)

    Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., Li, H.: End-to-end autonomous driving: Challenges and frontiers. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (2024)

  6. [6]

    IEEE trans- actions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

    Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: Transfuser: Imi- tation with transformer-based sensor fusion for autonomous driving. IEEE trans- actions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  8. [8]

    Contributors, O.: Openscene: The largest up-to-date 3d occupancy predic- tion benchmark in autonomous driving.https://github.com/OpenDriveLab/ OpenScene(2023)

  9. [9]

    DriveFine : Refining-augmented masked diffusion VLA for precise and robust driving

    Dang, C., Ang, S., Li, Y., Tian, H., Wang, J., Li, G., Ye, H., Ma, J., Chen, L., Wang, Y.: Drivefine: Refining-augmented masked diffusion vla for precise and ro- bust driving. arXiv preprint arXiv:2602.14577 (2026)

  10. [10]

    Advances in Neural Information Processing Systems37, 28706–28719 (2024)

    Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschenski, I., Ivanovic, B., Pavone, M., et al.: Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. Advances in Neural Information Processing Systems37, 28706–28719 (2024)

  11. [11]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B., Qi, C.R., Zhou, Y., et al.: Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9710–9719 (2021)

  12. [12]

    arXiv preprint arXiv:2509.15968 (2025)

    Fang, S., Cui, Y., Liang, H., Lv, C., Hang, P., Sun, J.: Corevla: A dual-stage end- to-end autonomous driving framework for long-tail scenarios via collect-and-refine. arXiv preprint arXiv:2509.15968 (2025)

  13. [13]

    Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

    Feng, L., Gao, Y., Zablocki, E., Li, Q., Li, W., Liu, S., Cord, M., Alahi, A.: Rap: 3d rasterization augmented end-to-end planning. arXiv preprint arXiv:2510.04333 (2025)

  14. [14]

    IEEE Robotics and Automation Letters11(1), 226–233 (2025) 16 Z

    Feng, R., Xi, N., Chu, D., Wang, R., Deng, Z., Wang, A., Lu, L., Wang, J., Huang, Y.: Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. IEEE Robotics and Automation Letters11(1), 226–233 (2025) 16 Z. Zhou, R. Yang et al

  15. [15]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

    Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., Bai, X.: Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. arXiv preprint arXiv:2503.19755 (2025)

  16. [16]

    Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,

    Fu, H., Zhang, D., Zhao, Z., Cui, J., Xie, H., Wang, B., Chen, G., Liang, D., Bai, X.: Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning. arXiv Preprint arXiv:2512.13636 (2025)

  17. [17]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Gao, H., Chen, S., Jiang, B., Liao, B., Shi, Y., Guo, X., Pu, Y., haoran yin, Li, X., xinbang zhang, ying zhang, Liu, W., Zhang, Q., Wang, X.: RAD: Training an end-to-end driving policy via large-scale 3DGS-based reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  18. [18]

    Gemini Team, Google DeepMind: Gemini 3: A new era of intelligence with gemini

  19. [19]

    Technical Report (2025),https://deepmind.google/models/gemini/

  20. [20]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  21. [21]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  22. [22]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Hu, T., Liu, X., Wang, S., Zhu, Y., Liang, A., Kong, L., Zhao, G., Gong, Z., Cen, J., Huang, Z., et al.: Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760 (2025)

  23. [23]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17853– 17862 (2023)

  24. [24]

    Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

    Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 (2024)

  25. [25]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., et al.: pi0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

  26. [26]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jia, X., Wu, P., Chen, L., Xie, J., He, C., Yan, J., Li, H.: Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21983–21994 (2023)

  28. [28]

    Jia, X., Yang, Z., Li, Q., Zhang, Z., Yan, J.: Bench2drive: Towards multi- abilitybenchmarkingofclosed-loopend-to-endautonomousdriving.arXivpreprint arXiv:2406.03877 (2024)

  29. [29]

    IRL-VLA: Training an vision-language-action policy via reward world model,

    Jiang, A., Gao, Y., Wang, Y., Sun, Z., Wang, S., Heng, Y., Sun, H., Tang, S., Zhu, L., Chai, J., et al.: Irl-vla: Training an vision-language-action policy via reward world model. arXiv preprint arXiv:2508.06571 (2025)

  30. [30]

    Senna: Bridging large vision-language models and end-to- end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

    Jiang, B., Chen, S., Liao, B., Zhang, X., Yin, W., Zhang, Q., Huang, C., Liu, W., Wang, X.: Senna: Bridging large vision-language models and end-to-end au- tonomous driving. arXiv preprint arXiv:2410.22313 (2024)

  31. [31]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang,C.,Wang,X.:Vad:Vectorizedscenerepresentationforefficientautonomous SpanVLA 17 driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8340–8350 (2023)

  32. [32]

    Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

    Jiang, B., Chen, S., Zhang, Q., Liu, W., Wang, X.: Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608 (2025)

  33. [33]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Karnchanachari, N., Geromichalos, D., Tan, K.S., Li, N., Eriksen, C., Yaghoubi, S., Mehdipour, N., Bernasconi, G., Fong, W.K., Guo, Y., et al.: Towards learning- based planning: The nuplan benchmark for real-world autonomous driving. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 629–636. IEEE (2024)

  34. [34]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success, 2025. URL https://arxiv. org/abs/2502.19645 (2016)

  35. [35]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  36. [36]

    Driving on registers.arXiv preprint arXiv:2601.05083, 2026

    Kirby, E., Boulch, A., Xu, Y., Yin, Y., Puy, G., Zablocki, É., Bursuc, A., Gi- daris, S., Marlet, R., Bartoccioni, F., et al.: Driving on registers. arXiv preprint arXiv:2601.05083 (2026)

  37. [37]

    IEEE Robotics and Automation Letters11(1), 818–825 (2025)

    Lei, M., Zhou, Z., Li, H., Ma, J., Hu, J.: Risk map as middleware: Toward inter- pretable cooperative end-to-end autonomous driving for risk-aware planning. IEEE Robotics and Automation Letters11(1), 818–825 (2025)

  38. [38]

    Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

    Li, D., Ren, J., Wang, Y., Wen, X., Li, P., Xu, L., Zhan, K., Xia, Z., Jia, P., Lang, X., et al.: Finetuning generative trajectory model with reinforcement learning from human feedback. arXiv preprint arXiv:2503.10434 (2025)

  39. [39]

    Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

    Li, Y., Shang, S., Liu, W., Zhan, B., Wang, H., Wang, Y., Chen, Y., Wang, X., An, Y., Tang, C., et al.: Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796 (2025)

  40. [40]

    Li, Y., Wang, Y., Liu, Y., He, J., Fan, L., Zhang, Z.: End-to-end driving with onlinetrajectoryevaluationviabevworldmodel.In:ProceedingsoftheIEEE/CVF International Conference on Computer Vision. pp. 27137–27146 (2025)

  41. [41]

    Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

    Li, Y., Xiong, K., Guo, X., Li, F., Yan, S., Xu, G., Zhou, L., Chen, L., Sun, H., Wang, B., et al.: Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. arXiv preprint arXiv:2506.08052 (2025)

  42. [42]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Li, Z., Li, K., Wang, S., Lan, S., Yu, Z., Ji, Y., Li, Z., Zhu, Z., Kautz, J., Wu, Z., et al.: Hydra-mdp: End-to-end multimodal planning with multi-target hydra- distillation. arXiv preprint arXiv:2406.06978 (2024)

  43. [43]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139, 2024

    Liao, B., Chen, S., Yin, H., Jiang, B., Wang, C., Yan, S., Zhang, X., Li, X., Zhang, Y., Zhang, Q., et al.: Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. arXiv preprint arXiv:2411.15139 (2024)

  44. [44]

    Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting.arXiv preprint arXiv:2503.07234, 2025

    Liao, H., Kong, H., Wang, B., Wang, C., Ye, W., He, Z., Xu, C., Li, Z.: Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting. arXiv preprint arXiv:2503.07234 (2025)

  45. [45]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  46. [46]

    IEEE Robotics and Automation Letters11(2), 1738–1745 (2025)

    Liu, D., Gao, Y., Qian, D., Zhang, Q., Ye, X., Han, J., Zheng, Y., Liu, X., Xia, Z., Ding, D., et al.: Takead: Preference-based post-optimization for end-to-end autonomous driving with expert takeover data. IEEE Robotics and Automation Letters11(2), 1738–1745 (2025)

  47. [47]

    DriveWorld-VLA: Unified latent- space world modeling with vision-language-action for autonomous driving.arXiv preprint arXiv:2602.06521, 2026

    Liu, L., Song, Z., Jia, C., Ye, H., Hao, X., Chen, L., et al.: Driveworld-vla: Unified latent-space world modeling with vision-language-action for autonomous driving. arXiv preprint arXiv:2602.06521 (2026) 18 Z. Zhou, R. Yang et al

  48. [48]

    CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving

    Liu, P., Ning, Q., Lu, X., Liu, H., Ma, W., She, D., Jia, P., Lang, X., Ma, J.: Omnireason: A temporal-guided vision-language-action framework for autonomous driving. arXiv preprint arXiv:2509.00789 (2025)

  49. [49]

    arXiv preprint arXiv:2510.00154 (2025)

    Liu, X., Sani, M.F., Zhou, Z., Wirbel, J., Zarrin, B., Galeazzi, R.: Robopilot: Gen- eralizable dynamic robotic manipulation with dual-thinking modes. arXiv preprint arXiv:2510.00154 (2025)

  50. [50]

    Ma, Y., Cao, Y., Ding, W., Zhang, S., Wang, Y., Ivanovic, B., Jiang, M., Pavone, M., Xiao, C.: dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning (2025)

  51. [51]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

    Mao, J., Qian, Y., Ye, J., Zhao, H., Wang, Y.: Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415 (2023)

  52. [52]

    o Research: 2025 waymo open dataset challenge: Vision-based end-to-end driv- ing.https://waymo.com/open/challenges/2025/e2e-driving/(2025), accessed: 2025-04-25

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pan, C., Yaman, B., Nesti, T., Mallik, A., Allievi, A.G., Velipasalar, S., Ren, L.: Vlp: Vision language planning for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14760– 14769 (2024)

  54. [54]

    Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models.arXiv preprint arXiv:2503.12772, 2025

    Park, S.Y., Cui, C., Ma, Y., Moradipari, A., Gupta, R., Han, K., Wang, Z.: Nu- planqa: A large-scale dataset and benchmark for multi-view driving scene under- standing in multi-modal large language models. arXiv preprint arXiv:2503.12772 (2025)

  55. [55]

    Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

    Peng, Z., Ding, W., You, Y., Chen, Y., Luo, W., Tian, T., Cao, Y., Sharma, A., Xu, D., Ivanovic, B., et al.: Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2512.24426 (2025)

  56. [56]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

    Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4542–4550 (2024)

  57. [57]

    Advances in Neural Information Processing Systems36, 53728–53741 (2023)

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems36, 53728–53741 (2023)

  58. [58]

    Nord: A data-efficient vision-language-action model that drives without reasoning.arXiv preprint arXiv:2602.21172,

    Rawal, I., Gupta, S., Hu, Y., Zhan, W.: Nord: A data-efficient vision-language- action model that drives without reasoning. arXiv preprint arXiv:2602.21172 (2026)

  59. [59]

    arXiv preprint arXiv:2506.11234 , year =

    Rowe, L., de Schaetzen, R., Girgis, R., Pal, C., Paull, L.: Poutine: Vision-language- trajectory pre-training and reinforcement learning post-training enable robust end- to-end autonomous driving. arXiv preprint arXiv:2506.11234 (2025)

  60. [60]

    Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

    Shang, S., Chen, Y., Wang, Y., Li, Y., Zhang, Z.: Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving. arXiv preprint arXiv:2509.17940 (2025)

  61. [61]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  62. [62]

    In: European Conference on Computer Vision

    Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European Conference on Computer Vision. pp. 256–274. Springer (2024)

  63. [63]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Song, R., Liang, C., Cao, H., Yan, Z., Zimmer, W., Gross, M., Festag, A., Knoll, A.: Collaborative semantic occupancy prediction with hybrid feature fusion in con- nected automated vehicles. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 17996–18006 (2024) SpanVLA 19

  64. [64]

    Latent Chain-of-Thought World Modeling for End-to-End Driving

    Tan, S., Chitta, K., Chen, Y., Tian, R., You, Y., Wang, Y., Luo, W., Cao, Y., Krahenbuhl, P., Pavone, M., et al.: Latent chain-of-thought world modeling for end-to-end driving. arXiv preprint arXiv:2512.10226 (2025)

  65. [65]

    arXiv preprint arXiv:2510.11083 (2025)

    Tan, T., Zheng, Y., Liang, R., Wang, Z., Zheng, K., Zheng, J., Li, J., Zhan, X., Liu, J.: Flow matching-based autonomous driving planning with advanced interactive behavior modeling. arXiv preprint arXiv:2510.11083 (2025)

  66. [66]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., Zhao, H.: Drivevlm: The convergence of autonomous driving and large vision- language models. arXiv preprint arXiv:2402.12289 (2024)

  67. [67]

    Learning from failure: Integrating negative examples when fine-tuning large language models as agents.arXiv preprint arXiv:2402.11651, 2024

    Wang, R., Li, H., Han, X., Zhang, Y., Baldwin, T.: Learning from failure: Inte- grating negative examples when fine-tuning large language models as agents. arXiv preprint arXiv:2402.11651 (2024)

  68. [68]

    Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.arXiv preprint arXiv:2405.01533, 2024

    Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y., Alvarez, J.M.: Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533 (2024)

  69. [69]

    Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

    Wang, Y., Luo, W., Bai, J., Cao, Y., Che, T., Chen, K., Chen, Y., Diamond, J., Ding, Y., Ding, W., et al.: Alpamayo-r1: Bridging reasoning and action pre- diction for generalizable autonomous driving in the long tail. arXiv preprint arXiv:2511.00088 (2025)

  70. [70]

    Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

    Wang, Y., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al.: Vla-adapter: An effective paradigm for tiny-scale vision-language- action model. arXiv preprint arXiv:2509.09372 (2025)

  71. [71]

    Latentvla: Efficient vision-language models for autonomous driving via latent action prediction

    Xie, C., Sun, B., Li, T., Wu, J., Hao, Z., Lang, X., Li, H.: Latentvla: Efficient vision-language models for autonomous driving via latent action prediction. arXiv preprint arXiv:2601.05611 (2026)

  72. [72]

    In: Proceed- ings of the Winter Conference on Applications of Computer Vision

    Xing, S., Qian, C., Wang, Y., Hua, H., Tian, K., Zhou, Y., Tu, Z.: Openemma: Open-source multimodal model for end-to-end autonomous driving. In: Proceed- ings of the Winter Conference on Applications of Computer Vision. pp. 1001–1009 (2025)

  73. [73]

    Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

    Xu, R., Lin, H., Jeon, W., Feng, H., Zou, Y., Sun, L., Gorman, J., Tolstaya, E., Tang, S., White, B., et al.: Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125 (2025)

  74. [74]

    Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

    Xu, Y., Hu, Y., Zhang, Z., Meyer, G.P., Mustikovela, S.K., Srinivasa, S., Wolff, E.M.,Huang,X.:Vlm-ad:End-to-endautonomousdrivingthroughvision-language model supervision. arXiv preprint arXiv:2412.14446 (2024)

  75. [75]

    arXiv preprint arXiv:2506.06659 (2025)

    Yao, W., Li, Z., Lan, S., Wang, Z., Sun, X., Alvarez, J.M., Wu, Z.: Drivesuprim: Towards precise trajectory selection for end-to-end planning. arXiv preprint arXiv:2506.06659 (2025)

  76. [76]

    A survey on efficient vision-language-action models, 2025

    Yu, Z., Wang, B., Zeng, P., Zhang, H., Zhang, J., Gao, L., Song, J., Sebe, N., Shen, H.T.: A survey on efficient vision-language-action models. arXiv preprint arXiv:2510.24795 (2025)

  77. [77]

    Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

    Yuan, C., Zhang, Z., Sun, J., Sun, S., Huang, Z., Lee, C.D.W., Li, D., Han, Y., Wong, A., Tee, K.P., et al.: Drama: An efficient end-to-end motion planner for autonomous driving with mamba. arXiv preprint arXiv:2408.03601 (2024)

  78. [78]

    Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

    Zeng, S., Chang, X., Xie, M., Liu, X., Bai, Y., Pan, Z., Xu, M., Wei, X., Guo, N.: Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685 (2025)

  79. [79]

    arXiv preprint arXiv:2602.21952 (2026) 20 Z

    Zhang, L., Yuan, Y., Wu, C., Chang, X., Cai, X., Zeng, S., Shi, L., Wang, S., Zhang, H., Xu, M.: Minddriver: Introducing progressive multimodal reasoning for autonomous driving. arXiv preprint arXiv:2602.21952 (2026) 20 Z. Zhou, R. Yang et al

  80. [80]

    BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

    Zhao, S.Z., Wang, L., Ruan, H., Bao, Y., Chen, Y., Leng, Z., Ravichandran, A., He, H., Zhou, Z., Han, X., et al.: Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving. arXiv preprint arXiv:2604.10856 (2026)

Showing first 80 references.