arxiv: 2503.07608 · v1 · submitted 2025-03-10 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang , Shaoyu Chen , Qian Zhang , Wenyu Liu , Xinggang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:03 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords autonomous drivingvision-language modelsreinforcement learningplanningreasoningGRPOend-to-end models

0 comments

The pith

Reinforcement learning with tailored rewards and a two-stage strategy improves vision-language models for autonomous driving planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AlphaDrive as a framework that applies reinforcement learning and reasoning to vision-language models specifically for planning in autonomous driving. It defines four GRPO-based rewards suited to planning tasks and combines them with supervised fine-tuning in a two-stage process. This produces measurable gains in planning accuracy and training speed over standard supervised approaches alone. Following the RL phase the models display new multimodal planning behaviors that support safer and more efficient driving decisions. A reader would care because existing end-to-end driving systems still fail on uncommon situations that require common-sense reasoning.

Core claim

AlphaDrive integrates four GRPO-based RL rewards designed for planning into vision-language models and trains them with a two-stage strategy that begins with supervised fine-tuning followed by RL; the resulting models achieve higher planning performance and efficiency than SFT-only baselines and exhibit emergent multimodal planning capabilities after RL training.

What carries the argument

Four GRPO-based RL rewards tailored for planning together with a two-stage SFT-plus-RL training strategy that elicits reasoning in VLMs.

If this is right

Planning accuracy rises compared with models trained only by supervised fine-tuning.
Training converges faster when reasoning is included via RL.
Emergent multimodal planning appears after the RL stage and supports safer driving choices.
The same reward and training design can be applied to other VLM-based planning tasks.
End-to-end driving systems gain common-sense capabilities without additional labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce reliance on large human-annotated driving datasets by letting RL discover useful reasoning patterns.
Similar reward structures could transfer to other sequential decision domains such as robotics or logistics planning.
Emergent multimodal behavior suggests that RL on VLMs can surface capabilities not explicitly optimized for in the rewards.
Real-world deployment would still require additional safety layers because the paper evaluates primarily on simulation or controlled data.

Load-bearing premise

The chosen RL rewards and two-stage schedule produce planning improvements that generalize to unseen real-world driving conditions rather than fitting only the training distribution.

What would settle it

Run the trained model on a fresh collection of real-world driving sequences containing long-tailed events and check whether safety-critical planning metrics fall below the SFT baseline.

read the original abstract

OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration of training strategies or optimizations specifically tailored for planning. In this paper, we propose AlphaDrive, a RL and reasoning framework for VLMs in autonomous driving. AlphaDrive introduces four GRPO-based RL rewards tailored for planning and employs a two-stage planning reasoning training strategy that combines SFT with RL. As a result, AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning. Moreover, we are also excited to discover that, following RL training, AlphaDrive exhibits some emergent multimodal planning capabilities, which is critical for improving driving safety and efficiency. To the best of our knowledge, AlphaDrive is the first to integrate GRPO-based RL with planning reasoning into autonomous driving. Code will be released to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlphaDrive shows clear planning gains from adding GRPO RL after SFT on VLMs, with ablations and OOD results backing the efficiency and performance claims.

read the letter

The main takeaway is that AlphaDrive improves planning performance and training efficiency by running GRPO-based RL after initial SFT on a vision-language model, and the RL stage produces some emergent multimodal planning behavior. The four tailored rewards and two-stage pipeline are presented as the first such combination in autonomous driving work. Experiments report quantitative deltas against SFT-only and no-reasoning baselines on held-out test sets, out-of-distribution scenarios, and safety proxies, which makes the gains traceable rather than asserted. Ablations on the individual rewards and training stages further support that the full setup is what drives the results. The experimental sections address the usual overfitting worry directly with proper splits and controls, and no internal inconsistencies appear in the reward definitions or emergence measurements. A minor soft spot is that the emergent capabilities are described at a high level without extensive failure-mode analysis or cross-model validation, so their reliability still needs more scrutiny. Real-vehicle transfer remains an open question, as it does for most current driving benchmarks. This paper is useful for researchers working on VLM-based planning or RL for structured tasks in robotics. Readers focused on end-to-end autonomous driving will get practical value from the reward design and training recipe. It deserves peer review because the claims rest on concrete, falsifiable experiments rather than abstract assertions.

Referee Report

0 major / 3 minor

Summary. The paper proposes AlphaDrive, a framework that applies GRPO-based reinforcement learning and reasoning to vision-language models for autonomous driving planning. It introduces four planning-specific RL rewards and a two-stage SFT-then-RL training pipeline, claiming measurable gains in planning performance and efficiency over SFT-only baselines, plus emergent multimodal planning behaviors after RL. The work is positioned as the first integration of GRPO RL with planning reasoning in this domain, supported by ablations, held-out test sets, OOD scenarios, and safety proxies.

Significance. If the reported quantitative deltas and controls hold, the result would provide concrete evidence that RL-augmented reasoning can mitigate long-tailed and common-sense limitations in end-to-end driving VLMs. The combination of performance gains, training efficiency improvements, and emergent capabilities on held-out and OOD data would be relevant for safety-critical applications; code release further strengthens reproducibility.

minor comments (3)

Abstract: the claim of 'significant' improvements would be stronger if one or two key quantitative metrics (e.g., planning success rate delta or efficiency gain) were included to allow readers to gauge magnitude without reading the full experiments section.
§4.3 (or equivalent ablation subsection): the description of the four GRPO rewards would benefit from an explicit equation or pseudocode block showing how each reward is computed from the planning output, even if the prose already defines them.
Figure 5 (emergent capabilities examples): adding a short caption or legend clarifying which behaviors are newly emergent versus present in the SFT baseline would improve clarity for readers comparing the two stages.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. We are pleased that the significance of integrating GRPO-based RL with planning reasoning in VLMs for autonomous driving is recognized, along with the reported gains in performance, efficiency, and emergent capabilities. Since no specific major comments were listed in the report, we have no point-by-point responses to provide.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies established GRPO RL (from prior external work) via four planning-specific rewards and a two-stage SFT+RL pipeline to VLMs in driving. No derivation, equation, or central claim reduces to a self-defined quantity, fitted parameter renamed as prediction, or self-citation chain. Experimental gains are shown via ablations on held-out/OOD sets rather than by construction. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to elements explicitly named there.

free parameters (1)

Four GRPO-based RL rewards
Custom rewards tailored for planning; exact functional forms and weighting not specified in abstract.

axioms (1)

domain assumption VLMs possess sufficient base capabilities to benefit from GRPO RL for planning tasks
The framework assumes the pre-trained VLM can be improved by the proposed rewards and training stages.

pith-pipeline@v0.9.0 · 5542 in / 1160 out tokens · 23968 ms · 2026-05-16T20:03:30.788130+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
cs.RO 2026-04 unverdicted novelty 7.0

SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
cs.RO 2026-03 unverdicted novelty 7.0

PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.
FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
cs.RO 2026-04 unverdicted novelty 6.0

Multi-ORFT improves closed-loop multi-agent driving planners by coupling scene-consistent diffusion pre-training with stable online RL post-training, reducing collisions and off-road rates while increasing speed on th...
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
cs.RO 2026-04 unverdicted novelty 6.0

LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.
How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
cs.CV 2026-04 conditional novelty 6.0

VENUSS benchmark shows top VLMs achieve 57% accuracy on sequential driving scenes, strong on static objects but weak on vehicle dynamics and temporal relations.
Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL...
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
cs.CL 2026-02 conditional novelty 6.0

EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
cs.CV 2025-06 unverdicted novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
cs.RO 2026-05 unverdicted novelty 5.0

CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
cs.CV 2026-04 unverdicted novelty 5.0

RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 17 Pith papers · 16 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 2022. 3

work page 2022
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. In Proceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, 2005. 7

work page 2005
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. In NeurIPS, 2020. 1, 3

work page 2020
[7]

Driving with llms: Fusing object-level vector modality for explainable autonomous driving.arXiv preprint arXiv:2310.01957, 2023

Long Chen, Oleg Sinavski, Jan H ¨unermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving.arXiv preprint arXiv:2310.01957, 2023. 2

work page arXiv 2023
[8]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In CVPR,

work page
[11]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 ,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Mathematical capabilities of chatgpt

Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tom- maso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In NeurIPS, 2024. 2

work page 2024
[13]

Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 3

work page arXiv 2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, 2023. 1, 3

work page 2023
[17]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, 2023. 1, 3

work page 2023
[18]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xing- gang Wang. Senna: Bridging large vision-language mod- els and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML,

work page
[21]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML,

work page
[22]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, 9 Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. arXiv preprint arXiv:2411.15139, 2024. 1, 3

work page arXiv 2024
[23]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR,

work page
[24]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024. 1, 3

work page 2024
[25]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. 2, 3

work page 2024
[26]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. Advances in neural information processing systems, 2022. 3

work page 2022
[27]

A survey of motion planning and con- trol techniques for self-driving urban vehicles

Brian Paden, Michal ˇC´ap, Sze Zheng Yong, Dmitry Yershov, and Emilio Frazzoli. A survey of motion planning and con- trol techniques for self-driving urban vehicles. IEEE Trans- actions on intelligent vehicles, 2016. 3

work page 2016
[28]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. 7

work page 2002
[29]

Multi- modal fusion transformer for end-to-end autonomous driv- ing

Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- modal fusion transformer for end-to-end autonomous driv- ing. In CVPR, 2021. 1, 3

work page 2021
[30]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In AAAI, 2024. 4

work page 2024
[31]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 2, 3, 4

work page 2023
[32]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual ques- tion answering. arXiv preprint arXiv:2312.14150, 2023. 2, 4

work page arXiv 2023
[35]

Monte carlo tree search: A review of recent modifications and applications

Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma ´ndziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56(3):2497–2562, 2023. 3

work page 2023
[36]

Stan- ley: The robot that won the darpa grand challenge

Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. Stan- ley: The robot that won the darpa grand challenge. Journal of field Robotics, 2006. 3

work page 2006
[37]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 3

work page 2017
[40]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In CVPR, 2015. 7

work page 2015
[41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533, 2024. 4

work page arXiv 2024
[43]

Chain-of-thought prompting elicits reasoning in large lan- guage models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems, 35:24824–24837, 2022. 3

work page 2022
[44]

Diffusion-vla: Scaling robot founda- tion models via unified diffusion and autoregression

Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Scaling robot founda- tion models via unified diffusion and autoregression. arXiv preprint arXiv:2412.03293, 2024. 4

work page arXiv 2024
[45]

Language prompt for autonomous driving

Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving. arXiv preprint arXiv:2309.04379,

work page arXiv
[46]

Self-evaluation guided beam search for reasoning

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. Advances in Neural In- formation Processing Systems, 36:41618–41650, 2023. 3

work page 2023
[47]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kenneth KY Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412 ,

work page arXiv
[48]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Embodied understanding of driving scenarios

Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, and Hongyang Li. Embodied understanding of driving scenarios. arXiv preprint arXiv:2403.04593, 2024. 2, 4 10

work page arXiv 2024