pith. sign in

arxiv: 2509.01944 · v3 · submitted 2025-09-02 · 💻 cs.RO · cs.CV

AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Pith reviewed 2026-05-18 20:16 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords autonomous drivingvision language action modelschain of thoughtself reflectionreinforcement learningtrajectory planningnuScenesWaymo
0
0 comments X

The pith

A VLA framework for autonomous driving gains better reasoning by using chain-of-thought processing and self-reflection during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve the decision-making process in vision-language-action models for self-driving cars by making it more interpretable and coherent. It does this by creating a dataset with a structured four-step reasoning chain that includes self-reflection to link what the car sees to how it should move. Reinforcement learning is then applied with a special optimization method and rewards based on physical rules to encourage safe and smooth paths. If this works, it could lead to autonomous vehicles that not only drive better but also explain their choices in a logical way. A sympathetic reader would care because current systems often act without clear reasoning, which raises safety concerns in real traffic.

Core claim

The central discovery is that fine-tuning on the nuScenesR²-6K dataset, which uses a four-step logical chain with self-reflection for validation, followed by optimization using the Group Relative Policy Optimization algorithm in a physics-grounded reward framework that includes spatial alignment, vehicle dynamics, and temporal smoothness, enables VLA models to achieve state-of-the-art performance and robust generalization on both nuScenes and Waymo datasets.

What carries the argument

The four-step logical chain with self-reflection in the nuScenesR²-6K dataset combined with Group Relative Policy Optimization under physics-grounded rewards for trajectory planning.

Load-bearing premise

The four-step logical chain with self-reflection builds real cognitive connections between perception and safe driving actions, while the physics rewards accurately reflect what makes trajectories feasible and safe in practice.

What would settle it

Running the trained model on a set of edge-case scenarios involving potential hazards and checking if it generates fewer invalid or unsafe trajectories than versions without the self-reflection step or the specific rewards.

Figures

Figures reproduced from arXiv: 2509.01944 by Chengxuan Qian, Dapeng Zhang, Jing Tang, Lei Sun, Rui Chen, Shuo Li, Xiangxiang Chu, Yujun Cai, Zhenlong Yuan, Zijian Song.

Figure 1
Figure 1. Figure 1: AutoDrive-R² can effectively achieve planning trajectories across multiple benchmarks compared with other models. Given [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of AutoDrive-R². We adopt a two-stage training process. The first stage introduce an innovative CoT dataset named [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of trajectory planning performance across Qwen2.5-VL-7B, EMMA+, and our AutoDrive-R² on the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of trajectory planning performance across Qwen2.5-VL-7B, EMMA+, and our AutoDrive-R² on the [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization comparison bewtwee Qwen2.5-VL-7B our AutoDrive-R² on the nuScenes dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization comparison bewtwee Qwen2.5-VL-7B our AutoDrive-R² on the nuScenes dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: "Aha Moment" of our AutoDrive-R² on the nuScenes dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-R$^2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesR$^2$-6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AutoDrive-R², a Vision-Language-Action (VLA) framework for autonomous driving that integrates chain-of-thought (CoT) processing with self-reflection via a new nuScenesR²-6K dataset for supervised fine-tuning (using a four-step logical chain) and Group Relative Policy Optimization (GRPO) reinforcement learning with a physics-grounded reward model incorporating spatial alignment, vehicle dynamics, and temporal smoothness criteria. It reports state-of-the-art performance and robust generalization on nuScenes and Waymo datasets.

Significance. If the central claims hold after proper verification, the work could meaningfully advance VLA models for autonomous driving by explicitly incentivizing interpretable reasoning and self-reflection, potentially leading to more coherent and safer trajectory outputs than standard end-to-end approaches. The combination of a specialized CoT dataset and physics-informed rewards is a promising direction for addressing plausibility issues in decision-making.

major comments (2)
  1. [Evaluation] Evaluation section: The manuscript asserts state-of-the-art performance and robust generalization across nuScenes and Waymo, yet the abstract and evaluation description provide no quantitative tables, ablation studies isolating the contribution of the four-step CoT self-reflection step, closed-loop simulation results, or direct comparisons against recent VLA baselines trained on equivalent data volumes. Without these, it is impossible to establish that the reported gains arise from the proposed reasoning mechanisms rather than dataset scale or base model capacity.
  2. [Reward Framework] Reward framework (physics-grounded terms): The reward function combines spatial alignment, vehicle dynamics, and temporal smoothness, but the weights for these terms are not shown to have been derived independently of the nuScenes/Waymo evaluation splits. If any tuning occurred on the same data used for final reporting, the performance numbers risk partial circularity, weakening the claim that the rewards ensure reliable real-world safety and feasibility.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'Extensive evaluation results' is used without any accompanying metrics, baseline names, or improvement deltas, which reduces immediate readability and makes it harder for readers to gauge the scale of the claimed advances.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We have addressed each of the major comments in detail below and revised the manuscript to include the requested clarifications, additional experiments, and explanations to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The manuscript asserts state-of-the-art performance and robust generalization across nuScenes and Waymo, yet the abstract and evaluation description provide no quantitative tables, ablation studies isolating the contribution of the four-step CoT self-reflection step, closed-loop simulation results, or direct comparisons against recent VLA baselines trained on equivalent data volumes. Without these, it is impossible to establish that the reported gains arise from the proposed reasoning mechanisms rather than dataset scale or base model capacity.

    Authors: We agree that these additional elements are necessary to rigorously substantiate the claims. In the revised manuscript, we have incorporated quantitative tables with direct comparisons against recent VLA baselines trained on comparable data volumes. We have added ablation studies that isolate the contribution of the four-step CoT self-reflection mechanism. We also include closed-loop simulation results on both nuScenes and Waymo to demonstrate practical generalization and safety. These changes clarify that the observed gains derive from the proposed reasoning and self-reflection components rather than data scale or base model alone. revision: yes

  2. Referee: [Reward Framework] Reward framework (physics-grounded terms): The reward function combines spatial alignment, vehicle dynamics, and temporal smoothness, but the weights for these terms are not shown to have been derived independently of the nuScenes/Waymo evaluation splits. If any tuning occurred on the same data used for final reporting, the performance numbers risk partial circularity, weakening the claim that the rewards ensure reliable real-world safety and feasibility.

    Authors: We thank the referee for highlighting this potential issue of circularity. The reward weights were determined using physical principles and a separate held-out validation subset drawn from the training data, with no overlap to the final nuScenes or Waymo evaluation splits. In the revised manuscript, we have expanded the reward framework section to explicitly describe this independent tuning procedure, including the validation protocol used. This addition removes any ambiguity and supports the reliability of the physics-grounded rewards for real-world safety and feasibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained against external benchmarks

full rationale

The paper constructs a custom CoT dataset (nuScenesR²-6K) via a four-step logical chain, applies GRPO under explicitly physics-grounded reward terms (spatial alignment, vehicle dynamics, temporal smoothness), and reports open-loop metrics on nuScenes and Waymo. None of these steps reduce by construction to the reported performance numbers: the reward criteria are presented as independent physical priors rather than fitted parameters, the evaluation splits are standard public benchmarks, and no load-bearing self-citation or uniqueness theorem is invoked to force the outcome. The central claims therefore rest on empirical comparison rather than definitional equivalence or statistical forcing.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that the constructed CoT traces and the chosen reward terms are sufficient to induce genuine reasoning rather than surface-level pattern matching. No free parameters are explicitly listed in the abstract, but reward weights and the exact four-step template are implicit fitted or chosen elements.

free parameters (1)
  • reward weights for spatial alignment, dynamics, and smoothness
    These scalars balance the three physics criteria and are required for the RL stage; their values are not derived from first principles.
axioms (1)
  • domain assumption The four-step logical chain with self-reflection produces trajectories that are more coherent and plausible than direct mapping from perception to action.
    Invoked in the description of the nuScenesR²-6K dataset construction.

pith-pipeline@v0.9.0 · 5768 in / 1367 out tokens · 48517 ms · 2026-05-18T20:16:59.271476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via ...

  3. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  4. The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

  5. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

  6. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  7. Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.

  8. Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    cs.RO 2025-10 conditional novelty 6.0

    Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in dr...

  9. IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

    cs.CV 2026-05 unverdicted novelty 5.0

    IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localizatio...

  10. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  11. Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

    cs.RO 2026-05 unverdicted novelty 5.0

    CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.

  12. Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

    cs.RO 2026-05 unverdicted novelty 5.0

    CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and PDMS 91.1 on Bench2Drive and NAVSIM.

  13. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  14. Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

    cs.AI 2025-10 unverdicted novelty 4.0

    A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 12 Pith papers · 4 internal anchors

  1. [1]

    Gpt-4 technical report, 2024

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, and et al. Gpt-4 technical report, 2024. 3

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  3. [3]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A visi...

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

  5. [5]

    What data do we need for training an av motion planner? In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 1066–1072

    Long Chen, Lukas Platinsky, Stefanie Speichert, Bła ˙zej Osi´nski, Oliver Scheel, Yawei Ye, Hugo Grimmett, Luca Del Pero, and Peter Ondruska. What data do we need for training an av motion planner? In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 1066–1072. IEEE, 2021. 2

  6. [6]

    Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025

    Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxi- ang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025. 3

  7. [7]

    Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024. 2, 3

  8. [8]

    MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, strong and open vi- sion language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023. 3

  9. [9]

    arXiv preprint arXiv:2504.02546 , year=

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforce- ment learning baseline for model reasoning.arXiv preprint arXiv:2504.02546, 2025. 3

  10. [10]

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, and et al. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025. 3, 4

  11. [11]

    Planning-oriented autonomous driv- ing

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driv- ing. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023. 2

  12. [12]

    Drivemm: All-in-one large multimodal model for autonomous driving,

    Zhijian Huang, Chengjian Fen, Feng Yan, Baihui Xiao, Ze- qun Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Drivemm: All-in-one large multimodal model for autonomous driving. arXiv preprint arXiv:2412.07689, 2024. 3

  13. [13]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 8306–8316, 2023. 2, 3

  14. [14]

    Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reason- ing, 2025

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xing- gang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reason- ing, 2025. 5

  15. [15]

    Learning to drive in a day

    Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In2019 International Conference on Robotics and Automa- tion (ICRA), pages 8248–8254. IEEE, 2019. 2

  16. [16]

    Reinforcement learning from human feed- back, 2025

    Nathan Lambert. Reinforcement learning from human feed- back, 2025. 3

  17. [17]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022. 3

  18. [18]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the 40th International Conference on Machine Learning, pages 19730–19742. PMLR, 2023. 3

  19. [19]

    Deepseek-v3 technical report, 2025

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, and et al. Deepseek-v3 technical report, 2025. 3

  20. [20]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 3

  21. [21]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3

  22. [22]

    Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 3

  23. [23]

    Proximal policy optimization algo- rithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms, 2017. 3

  24. [24]

    Waslander, Yu Liu, and Hongsheng Li

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L. Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 15120–15130, 2024. 2, 3

  25. [25]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3

  26. [26]

    Sparsedrive: End-to-end au- tonomous driving via sparse scene representation, 2024

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation, 2024. 2

  27. [27]

    Drivevlm: The convergence of autonomous driving and large vision-language models, 2024

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models, 2024. 2, 3

  28. [28]

    Llama: Open and efficient foundation lan- guage models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models, 2023. 3

  29. [29]

    Drivemlm: Aligning multi- modal large language models with behavioral planning states for autonomous driving, 2023

    Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, and Jifeng Dai. Drivemlm: Aligning multi- modal large language models with behavioral planning states for autonomous driving, 2023. 2, 3

  30. [30]

    Omni- gen2: Exploration to advanced multimodal generation, 2025

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation, 2025. 3

  31. [31]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Let- ters, 2024

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Let- ters, 2024. 3

  32. [32]

    Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving, 2025

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving, 2025. 2, 3

  33. [33]

    Fusionad: Multi-modality fusion for predic- tion and planning tasks of autonomous driving, 2023

    Tengju Ye, Wei Jing, Chunyong Hu, Shikun Huang, Ling- ping Gao, Fangzhen Li, Jingke Wang, Ke Guo, Wencong Xiao, Weibo Mao, Hang Zheng, Kun Li, Junbo Chen, and Kaicheng Yu. Fusionad: Multi-modality fusion for predic- tion and planning tasks of autonomous driving, 2023. 3

  34. [34]

    Vasilakos, and Thippa Reddy Gadekallu

    Gokul Yenduri, Ramalingam M, Chemmalar Selvi G, Supriya Y , Gautam Srivastava, Praveen Kumar Reddy Mad- dikunta, Deepti Raj G, Rutvij H Jhaveri, Prabadevi B, Weizheng Wang, Athanasios V . Vasilakos, and Thippa Reddy Gadekallu. Generative pre-trained trans- former: A comprehensive review on enabling technologies, potential applications, emerging challenges...

  35. [35]

    Scaling relationship on learning mathematical reasoning with large language models, 2023

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023. 3

  36. [36]

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, and Alois C. Knoll. Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model,

  37. [37]

    Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma

    Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning, 2025. 2, 3

  38. [38]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 3 AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving Supplementary Material A. Summary This supplementary ma...