arxiv: 2506.08052 · v2 · submitted 2025-06-09 · 💻 cs.CV · cs.RO

Recognition: 3 theorem links

· Lean Theorem

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li , Kaixin Xiong , Xiangyu Guo , Fang Li , Sixu Yan , Gangwei Xu , Lijun Zhou , Long Chen

show 7 more authors

Haiyang Sun Bing Wang Kun Ma Guang Chen Hangjun Ye Wenyu Liu Xinggang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:32 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords end-to-end autonomous drivingvision-language modelsdiffusion plannerreinforcement learningtrajectory planningNAVSIM benchmarkBench2Drive benchmark

0 comments

The pith

ReCogDrive combines a vision-language model for cognition with a reinforced diffusion planner to generate feasible, safe driving trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix problems that arise when vision-language models treat driving as a language task, such as invalid action formats and slow responses. It first builds human-like driving knowledge into the model through a three-stage data process of generation, refinement, and quality control. The learned knowledge then guides a diffusion planner that produces continuous trajectories instead of text tokens. A further reinforcement step called DiffGRPO tunes the planner for fewer collisions and greater comfort. Tests on NAVSIM and Bench2Drive show top scores and clearer scene understanding across varied conditions.

Core claim

ReCogDrive unifies understanding and planning by pairing an autoregressive vision-language model with a diffusion planner. Human driving cognition is transferred via a hierarchical pipeline of generation, refinement, and quality control. The model's priors are injected into the diffusion planner to produce stable continuous trajectories, and DiffGRPO reinforcement is applied to improve safety and comfort, resulting in state-of-the-art benchmark performance.

What carries the argument

The hierarchical data pipeline that embeds human driving cognition into the VLM, followed by injection of those priors into a diffusion planner reinforced by DiffGRPO for trajectory generation.

Load-bearing premise

The three-stage data pipeline transfers genuine human driving cognition into the model without embedding dataset-specific biases that block performance in real driving conditions.

What would settle it

Real-world closed-loop driving tests that measure collision rate and trajectory feasibility on routes not seen in the training data, compared against prior VLM-only planners.

read the original abstract

Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. All code, model weights, and datasets will be made publicly available to facilitate subsequent research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReCogDrive pairs an autoregressive VLM with a diffusion planner and adds DiffGRPO, but the SOTA claims on NAVSIM and Bench2Drive rest on abstract-level assertions without visible numbers or ablations.

read the letter

The main thing to know is that this paper moves VLM driving work away from language-space trajectory output by feeding VLM priors into a conditioned diffusion planner and then running a custom reinforcement stage called DiffGRPO on top. The hierarchical data pipeline (generation, refinement, quality control) is meant to build human-like scene cognition before that handoff happens. That combination is the clearest new element relative to the prior work they cite. It directly targets the format violations, infeasible actions, and slow inference that come from treating planning as next-token prediction. The diffusion step should produce smoother, continuous trajectories, and the safety-focused RL pass is a straightforward way to reduce collisions once the planner is in place. Those pieces line up logically with the problems they describe. The soft spots sit mostly in the evaluation. The abstract states SOTA results on the two benchmarks but supplies no baseline scores, margins, error bars, or ablation tables, so it is impossible to tell how much the new components actually move the needle versus better hyperparameter choices or data overlap. The stress-test worry about the pipeline injecting benchmark-specific biases is reasonable to flag: without reported checks on whether the refined data differs from the test distributions or holds up out-of-distribution, the claimed cognitive transfer could partly be distribution matching. If the full paper has those controls and the public code release happens as promised, the architecture itself is still worth examining. This is the sort of paper that belongs in a reading group on end-to-end driving or multimodal planners. Readers who work on inference speed and safety constraints in real vehicles will find the motivation and the VLM-to-diffusion bridge useful to think through. It is coherent enough on its own terms to deserve serious referee time rather than a desk reject, even though the current evidence level will require substantial revision on the experiments.

Referee Report

3 major / 2 minor

Summary. The paper proposes ReCogDrive, a framework for end-to-end autonomous driving that integrates an autoregressive VLM (instilled with human driving cognition via a three-stage hierarchical data pipeline of generation, refinement, and quality control) with a diffusion planner to resolve language-action mismatches, further optimized via Diffusion Group Relative Policy Optimization (DiffGRPO) for safety and comfort; it claims SOTA performance on the NAVSIM and Bench2Drive benchmarks plus strong qualitative scene comprehension on DriveBench.

Significance. If the SOTA claims and generalization hold after proper validation, the work would meaningfully advance VLM-based driving by demonstrating a practical unification of cognitive priors with continuous trajectory generation, potentially improving handling of long-tail scenarios while maintaining real-time feasibility; the public release of code, weights, and data would further strengthen its impact.

major comments (3)

[Abstract] Abstract and Experiments section: the SOTA claim on NAVSIM and Bench2Drive is stated without any quantitative baseline numbers, statistical significance tests, error bars, or ablation tables, leaving the central performance result impossible to assess from the provided information.
[Method] Method section (hierarchical data pipeline): no ablation studies, distribution-shift metrics, or out-of-distribution tests are reported to verify that the generation/refinement/quality-control stages instill transferable cognition rather than benchmark-specific biases; if the pipeline sources overlap with NAVSIM/Bench2Drive simulators, gains may reflect distribution matching instead of genuine cognitive transfer.
[Method] Method section (DiffGRPO): the reinforcement stage is introduced as a novel component with free hyperparameters, yet no comparison to standard policy-gradient or diffusion-specific RL baselines is supplied, nor is any sensitivity analysis given for those hyperparameters.

minor comments (2)

[Abstract] Abstract: the acronym DiffGRPO is used before any expansion or definition, which reduces immediate readability.
[Experiments] Qualitative results: the DriveBench examples would benefit from explicit failure-case analysis to substantiate the 'strong scene comprehension' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below and will revise the manuscript to incorporate additional quantitative details, ablations, and analyses where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: the SOTA claim on NAVSIM and Bench2Drive is stated without any quantitative baseline numbers, statistical significance tests, error bars, or ablation tables, leaving the central performance result impossible to assess from the provided information.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The full Experiments section contains baseline comparisons, but we will revise the abstract to report specific metrics (e.g., NAVSIM and Bench2Drive scores with improvements over baselines). We will also add error bars, statistical significance tests, and ensure ablation tables are prominently featured in the revised Experiments section. revision: yes
Referee: [Method] Method section (hierarchical data pipeline): no ablation studies, distribution-shift metrics, or out-of-distribution tests are reported to verify that the generation/refinement/quality-control stages instill transferable cognition rather than benchmark-specific biases; if the pipeline sources overlap with NAVSIM/Bench2Drive simulators, gains may reflect distribution matching instead of genuine cognitive transfer.

Authors: We acknowledge the need for explicit verification of cognitive transfer. In revision we will add ablation studies that isolate each pipeline stage and quantify performance drops. We will clarify that the data sources include diverse real-world logs and synthetic scenarios beyond the benchmark simulators and will report distribution-shift metrics. Full OOD evaluation on entirely new simulators is noted as a limitation for future work. revision: partial
Referee: [Method] Method section (DiffGRPO): the reinforcement stage is introduced as a novel component with free hyperparameters, yet no comparison to standard policy-gradient or diffusion-specific RL baselines is supplied, nor is any sensitivity analysis given for those hyperparameters.

Authors: We agree that direct comparisons would better substantiate the contribution of DiffGRPO. In the revised manuscript we will include results against standard policy-gradient methods and other diffusion RL baselines, together with a sensitivity analysis on the key hyperparameters, all added to the Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity detected in ReCogDrive derivation or claims

full rationale

The paper's SOTA claims on NAVSIM and Bench2Drive rest on external benchmark evaluations rather than any quantities defined in terms of the method's own fitted parameters or self-referential derivations. The hierarchical data pipeline (generation, refinement, quality control) is presented as an input-generation process to instill cognition into the VLM, followed by integration with a diffusion planner and DiffGRPO reinforcement; none of these steps reduce by construction to the target performance metrics or to self-citations that bear the central load. No equations, uniqueness theorems, or ansatzes are invoked that collapse the claimed cognitive transfer or trajectory generation back to the inputs by definition. The derivation chain remains self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard machine-learning assumptions about diffusion models and VLMs plus newly introduced components whose effectiveness is validated only through the reported benchmarks.

free parameters (1)

DiffGRPO optimization hyperparameters
Parameters controlling the reinforcement learning stage that are tuned to improve safety and comfort metrics.

axioms (1)

domain assumption Conditioning a diffusion planner on VLM-derived driving priors produces feasible and stable trajectories.
Invoked when the paper states that VLM priors are injected into the diffusion planner.

invented entities (1)

DiffGRPO no independent evidence
purpose: Reinforce the diffusion planner to reduce collisions and improve comfort.
Newly proposed reinforcement learning variant introduced in the paper.

pith-pipeline@v0.9.0 · 5600 in / 1384 out tokens · 46131 ms · 2026-05-15T07:32:02.372355+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort.
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
cs.RO 2026-04 unverdicted novelty 7.0

SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
GSDrive: Reinforcing Driving Policies by Multi-mode Trajectory Probing with 3D Gaussian Splatting Environment
cs.RO 2026-04 unverdicted novelty 6.0

GSDrive improves end-to-end driving policies through 3D Gaussian Splatting simulation and multi-mode trajectory probing that supplies dense, differentiable rewards for reinforcement learning.
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
cs.CV 2026-04 unverdicted novelty 6.0

Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
cs.RO 2026-04 unverdicted novelty 6.0

Multi-ORFT improves closed-loop multi-agent driving planners by coupling scene-consistent diffusion pre-training with stable online RL post-training, reducing collisions and off-road rates while increasing speed on th...
Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
cs.LG 2026-04 unverdicted novelty 6.0

TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
cs.RO 2026-05 unverdicted novelty 5.0

CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
cs.LG 2026-05 unverdicted novelty 5.0

CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
cs.CV 2026-04 unverdicted novelty 5.0

RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
cs.CV 2026-03 unverdicted novelty 5.0

DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 20 Pith papers · 17 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Is a 3d-tokenized llm the key to reliable autonomous driving? arXiv preprint arXiv:2405.18361,

Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, et al. Is a 3d-tokenized llm the key to reliable autonomous driving? arXiv preprint arXiv:2405.18361,

work page arXiv
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-languageaction flow model for general robot control.arXiv preprint arXiv:2410.24164, 2(3):5,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810,

work page internal anchor Pith review arXiv
[7]

Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

work page arXiv 1910
[8]

Automated evaluation of large vision-language models on self-driving corner cases.arXiv preprint arXiv:2404.10595, 2024a

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases.arXiv preprint arXiv:2404.10595, 2024a. Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang ...

work page arXiv
[9]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,

Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie-Francine Moens. Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,

work page arXiv 1909
[11]

Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580,

Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yanjun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580,

work page arXiv
[12]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755,

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755,

work page arXiv
[13]

arXiv preprint arXiv:2502.13144 (2025) 4

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.arXiv preprint arXiv:2502.13144,

work page arXiv
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

work page arXiv
[19]

Senna: Bridging large vision-language mod- els and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024a. Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous dr...

work page arXiv
[20]

Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025a

12 Preprint Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025a. Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving wi...

work page arXiv
[21]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

work page arXiv
[22]

Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800,

Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800,

work page arXiv
[23]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/. Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundat...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a. Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023b. Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, ...

work page arXiv
[25]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 194–210. Springer,

work page 2020
[26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588,

work page arXiv
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15120–15130, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, M...

work page internal anchor Pith review Pith/arXiv arXiv 2002
[29]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. ...

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Openemma: Open-source multimodal model for end-to-end autonomous driving.arXiv preprint arXiv:2412.15208,

14 Preprint Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving.arXiv preprint arXiv:2412.15208,

work page arXiv
[32]

arXiv preprint arXiv:2408.03601 (2024) 13

Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601,

work page arXiv
[33]

Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

work page arXiv
[34]

Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving.arXiv preprint arXiv:2404.06892, 2024a

Diankun Zhang, Guoan Wang, Runwen Zhu, Jianbo Zhao, Xiwu Chen, Siyu Zhang, Jiahao Gong, Qibin Zhou, Wenyuan Zhang, Ningzi Wang, et al. Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving.arXiv preprint arXiv:2404.06892, 2024a. Dongkun Zhang, Jiaming Liang, Ke Guo, Sha Lu, Qi Wang, Rong Xiong, Zhenwei Miao, and Yue Wang. Car...

work page arXiv
[35]

Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024b

Ruijun Zhang, Xianda Guo, Wenzhao Zheng, Chenming Zhang, Kurt Keutzer, and Long Chen. Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024b. Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, and Chen Lv. Wisead: Knowledge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951...

work page arXiv
[36]

Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,

15 Preprint Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,

work page arXiv
[37]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

First, in Sec

16 Preprint A APPENDIX We organize the supplementary material as follows. First, in Sec. B, we address potential questions that may arise from reading the main text. We then report ReCogDrive’s performance on the NA VSIM and DriveBench benchmarks, along with more detailed ablation studies in Sec. C. In Sec. D, we provide details of the training data colle...

work page 2024
[40]

Therefore, we primarily use the Planning task to evaluate a model’s practical capabilities in dynamic driving scenarios

Q4.Does DriveVQA accurately reflect the capabilities of a VLM for autonomous driving? DriveBench has shown that many VLMs’ decision accuracy does not degrade with visual quality, indicating a reliance on priors rather than genuine visual understanding. Therefore, we primarily use the Planning task to evaluate a model’s practical capabilities in dynamic dr...

work page 2022
[41]

97.5 96.3 80.1 93.010099.9 98.3 65.5 97.4 79.8 Hydra-MDP++ (Li et al., 2024)97.996.579.2 93.4100 100.098.9 67.297.7 80.6 ARTEMIS (Feng et al., 2025)98.3 95.1 81.5 97.410099.8 98.6 96.5 98.3 83.1 ReCogDrive 98.395.287.1 97.598.3 99.899.5 96.686.5 83.6 Experiments on NA VSIM with extended metrics.Hydra MDP++ (Li et al.,

work page 2024
[42]

We evaluate ReCogDrive on NA VSIM using these extended metrics as well

introduces additional evaluation metrics: Traffic Light Compliance (TL), Lane Keeping Ability (LK), Driving Direction Compliance (DDC) and Extended Comfort (EC) to more comprehensively assess driving performance. We evaluate ReCogDrive on NA VSIM using these extended metrics as well. Tab. 6 1https://github.com/autonomousvision/navsim/issues/116 18 Preprin...

work page arXiv 2024
[43]

10, outputting only trajectory achieves per- formance comparable to adding high-level command guidance, with almost identical PDMS scores

Effect of VLM guidance modes.As shown in Tab. 10, outputting only trajectory achieves per- formance comparable to adding high-level command guidance, with almost identical PDMS scores. Interestingly, incorporating chain-of-thought reasoning does not further improve the results; instead, it slightly decreases the PDMS score by 0.1. This suggests that the c...

work page 2019
[44]

CODA-LM(Chen et al., 2024a) comprises 9,768 real-world driving scenarios with 41,722 textual annotations for critical road entities and 21,537 annotations for road corner cases

is a dataset constructed based on Nuscenes (Caesar et al., 2020), containing 91K multi-view video instruction-response pairs in 17 subtasks. CODA-LM(Chen et al., 2024a) comprises 9,768 real-world driving scenarios with 41,722 textual annotations for critical road entities and 21,537 annotations for road corner cases. OminiDrive(Wang et al., 2024b) covers ...

work page 2020
[45]

with crafted prompts to generate anno- tations across the full spectrum of autonomous driving tasks on NA VSIM (Dauner et al., 2025). These tasks span perception (e.g., scene description, key object identification, road marking recognition, traffic light classification, vulnerable road user detection), prediction (e.g., motion prediction), plan- ning (e.g...

work page 2025