pith. machine review for the scientific record. sign in

arxiv: 2506.08052 · v2 · submitted 2025-06-09 · 💻 cs.CV · cs.RO

Recognition: 3 theorem links

· Lean Theorem

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:32 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords end-to-end autonomous drivingvision-language modelsdiffusion plannerreinforcement learningtrajectory planningNAVSIM benchmarkBench2Drive benchmark
0
0 comments X

The pith

ReCogDrive combines a vision-language model for cognition with a reinforced diffusion planner to generate feasible, safe driving trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix problems that arise when vision-language models treat driving as a language task, such as invalid action formats and slow responses. It first builds human-like driving knowledge into the model through a three-stage data process of generation, refinement, and quality control. The learned knowledge then guides a diffusion planner that produces continuous trajectories instead of text tokens. A further reinforcement step called DiffGRPO tunes the planner for fewer collisions and greater comfort. Tests on NAVSIM and Bench2Drive show top scores and clearer scene understanding across varied conditions.

Core claim

ReCogDrive unifies understanding and planning by pairing an autoregressive vision-language model with a diffusion planner. Human driving cognition is transferred via a hierarchical pipeline of generation, refinement, and quality control. The model's priors are injected into the diffusion planner to produce stable continuous trajectories, and DiffGRPO reinforcement is applied to improve safety and comfort, resulting in state-of-the-art benchmark performance.

What carries the argument

The hierarchical data pipeline that embeds human driving cognition into the VLM, followed by injection of those priors into a diffusion planner reinforced by DiffGRPO for trajectory generation.

Load-bearing premise

The three-stage data pipeline transfers genuine human driving cognition into the model without embedding dataset-specific biases that block performance in real driving conditions.

What would settle it

Real-world closed-loop driving tests that measure collision rate and trajectory feasibility on routes not seen in the training data, compared against prior VLM-only planners.

read the original abstract

Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. All code, model weights, and datasets will be made publicly available to facilitate subsequent research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ReCogDrive, a framework for end-to-end autonomous driving that integrates an autoregressive VLM (instilled with human driving cognition via a three-stage hierarchical data pipeline of generation, refinement, and quality control) with a diffusion planner to resolve language-action mismatches, further optimized via Diffusion Group Relative Policy Optimization (DiffGRPO) for safety and comfort; it claims SOTA performance on the NAVSIM and Bench2Drive benchmarks plus strong qualitative scene comprehension on DriveBench.

Significance. If the SOTA claims and generalization hold after proper validation, the work would meaningfully advance VLM-based driving by demonstrating a practical unification of cognitive priors with continuous trajectory generation, potentially improving handling of long-tail scenarios while maintaining real-time feasibility; the public release of code, weights, and data would further strengthen its impact.

major comments (3)
  1. [Abstract] Abstract and Experiments section: the SOTA claim on NAVSIM and Bench2Drive is stated without any quantitative baseline numbers, statistical significance tests, error bars, or ablation tables, leaving the central performance result impossible to assess from the provided information.
  2. [Method] Method section (hierarchical data pipeline): no ablation studies, distribution-shift metrics, or out-of-distribution tests are reported to verify that the generation/refinement/quality-control stages instill transferable cognition rather than benchmark-specific biases; if the pipeline sources overlap with NAVSIM/Bench2Drive simulators, gains may reflect distribution matching instead of genuine cognitive transfer.
  3. [Method] Method section (DiffGRPO): the reinforcement stage is introduced as a novel component with free hyperparameters, yet no comparison to standard policy-gradient or diffusion-specific RL baselines is supplied, nor is any sensitivity analysis given for those hyperparameters.
minor comments (2)
  1. [Abstract] Abstract: the acronym DiffGRPO is used before any expansion or definition, which reduces immediate readability.
  2. [Experiments] Qualitative results: the DriveBench examples would benefit from explicit failure-case analysis to substantiate the 'strong scene comprehension' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below and will revise the manuscript to incorporate additional quantitative details, ablations, and analyses where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: the SOTA claim on NAVSIM and Bench2Drive is stated without any quantitative baseline numbers, statistical significance tests, error bars, or ablation tables, leaving the central performance result impossible to assess from the provided information.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full Experiments section contains baseline comparisons, but we will revise the abstract to report specific metrics (e.g., NAVSIM and Bench2Drive scores with improvements over baselines). We will also add error bars, statistical significance tests, and ensure ablation tables are prominently featured in the revised Experiments section. revision: yes

  2. Referee: [Method] Method section (hierarchical data pipeline): no ablation studies, distribution-shift metrics, or out-of-distribution tests are reported to verify that the generation/refinement/quality-control stages instill transferable cognition rather than benchmark-specific biases; if the pipeline sources overlap with NAVSIM/Bench2Drive simulators, gains may reflect distribution matching instead of genuine cognitive transfer.

    Authors: We acknowledge the need for explicit verification of cognitive transfer. In revision we will add ablation studies that isolate each pipeline stage and quantify performance drops. We will clarify that the data sources include diverse real-world logs and synthetic scenarios beyond the benchmark simulators and will report distribution-shift metrics. Full OOD evaluation on entirely new simulators is noted as a limitation for future work. revision: partial

  3. Referee: [Method] Method section (DiffGRPO): the reinforcement stage is introduced as a novel component with free hyperparameters, yet no comparison to standard policy-gradient or diffusion-specific RL baselines is supplied, nor is any sensitivity analysis given for those hyperparameters.

    Authors: We agree that direct comparisons would better substantiate the contribution of DiffGRPO. In the revised manuscript we will include results against standard policy-gradient methods and other diffusion RL baselines, together with a sensitivity analysis on the key hyperparameters, all added to the Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity detected in ReCogDrive derivation or claims

full rationale

The paper's SOTA claims on NAVSIM and Bench2Drive rest on external benchmark evaluations rather than any quantities defined in terms of the method's own fitted parameters or self-referential derivations. The hierarchical data pipeline (generation, refinement, quality control) is presented as an input-generation process to instill cognition into the VLM, followed by integration with a diffusion planner and DiffGRPO reinforcement; none of these steps reduce by construction to the target performance metrics or to self-citations that bear the central load. No equations, uniqueness theorems, or ansatzes are invoked that collapse the claimed cognitive transfer or trajectory generation back to the inputs by definition. The derivation chain remains self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard machine-learning assumptions about diffusion models and VLMs plus newly introduced components whose effectiveness is validated only through the reported benchmarks.

free parameters (1)
  • DiffGRPO optimization hyperparameters
    Parameters controlling the reinforcement learning stage that are tuned to improve safety and comfort metrics.
axioms (1)
  • domain assumption Conditioning a diffusion planner on VLM-derived driving priors produces feasible and stable trajectories.
    Invoked when the paper states that VLM priors are injected into the diffusion planner.
invented entities (1)
  • DiffGRPO no independent evidence
    purpose: Reinforce the diffusion planner to reduce collisions and improve comfort.
    Newly proposed reinforcement learning variant introduced in the paper.

pith-pipeline@v0.9.0 · 5600 in / 1384 out tokens · 46131 ms · 2026-05-15T07:32:02.372355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control.

  • IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort.

  • IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 7.0

    VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.

  3. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.

  4. SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

    cs.RO 2026-04 unverdicted novelty 7.0

    SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.

  5. The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

  6. Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

    cs.RO 2026-03 unverdicted novelty 7.0

    PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.

  7. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  8. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  9. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  10. DriveFuture: Future-Aware Latent World Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

  11. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

  12. GSDrive: Reinforcing Driving Policies by Multi-mode Trajectory Probing with 3D Gaussian Splatting Environment

    cs.RO 2026-04 unverdicted novelty 6.0

    GSDrive improves end-to-end driving policies through 3D Gaussian Splatting simulation and multi-mode trajectory probing that supplies dense, differentiable rewards for reinforcement learning.

  13. Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

    cs.CV 2026-04 unverdicted novelty 6.0

    Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.

  14. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  15. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  16. OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

  17. SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    Multi-ORFT improves closed-loop multi-agent driving planners by coupling scene-consistent diffusion pre-training with stable online RL post-training, reducing collisions and off-road rates while increasing speed on th...

  18. Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.

  19. ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

  20. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  21. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  22. Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

    cs.RO 2026-05 unverdicted novelty 5.0

    CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.

  23. CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...

  24. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

  25. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

  26. DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving

    cs.CV 2026-03 unverdicted novelty 5.0

    DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 21 Pith papers · 17 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  3. [3]

    Is a 3d-tokenized llm the key to reliable autonomous driving? arXiv preprint arXiv:2405.18361,

    Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, et al. Is a 3d-tokenized llm the key to reliable autonomous driving? arXiv preprint arXiv:2405.18361,

  4. [4]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-languageaction flow model for general robot control.arXiv preprint arXiv:2410.24164, 2(3):5,

  6. [6]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810,

  7. [7]

    Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

    Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

  8. [8]

    Automated evaluation of large vision-language models on self-driving corner cases.arXiv preprint arXiv:2404.10595, 2024a

    Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases.arXiv preprint arXiv:2404.10595, 2024a. Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang ...

  9. [9]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

  10. [10]

    Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,

    Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie-Francine Moens. Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,

  11. [11]

    Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving

    Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yanjun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580,

  12. [12]

    Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation.arXiv preprint arXiv:2503.19755,

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755,

  13. [13]

    Rad: Training an end-to-end driving pol- icy via large-scale 3dgs-based reinforcement learning.arXiv preprint arXiv:2502.13144, 2025

    Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.arXiv preprint arXiv:2502.13144,

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  15. [15]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

  16. [16]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  17. [17]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

  18. [18]

    Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

  19. [19]

    Senna: Bridging large vision-language mod- els and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024a. Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous dr...

  20. [20]

    Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025a

    12 Preprint Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025a. Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving wi...

  21. [21]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

  22. [22]

    Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800,

    Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800,

  23. [23]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/. Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundat...

  24. [24]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a. Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023b. Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, ...

  25. [25]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 194–210. Springer,

  26. [26]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

  27. [27]

    Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588,

  28. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15120–15130, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, M...

  29. [29]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289,

  30. [30]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. ...

  31. [31]

    Openemma: Open-source multimodal model for end-to-end autonomous driving.arXiv preprint arXiv:2412.15208,

    14 Preprint Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving.arXiv preprint arXiv:2412.15208,

  32. [32]

    arXiv preprint arXiv:2408.03601 (2024) 13

    Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601,

  33. [33]

    Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

    Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

  34. [34]

    Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving.arXiv preprint arXiv:2404.06892, 2024a

    Diankun Zhang, Guoan Wang, Runwen Zhu, Jianbo Zhao, Xiwu Chen, Siyu Zhang, Jiahao Gong, Qibin Zhou, Wenyuan Zhang, Ningzi Wang, et al. Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving.arXiv preprint arXiv:2404.06892, 2024a. Dongkun Zhang, Jiaming Liang, Ke Guo, Sha Lu, Qi Wang, Rong Xiong, Zhenwei Miao, and Yue Wang. Car...

  35. [35]

    Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024b

    Ruijun Zhang, Xianda Guo, Wenzhao Zheng, Chenming Zhang, Kurt Keutzer, and Long Chen. Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024b. Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, and Chen Lv. Wisead: Knowledge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951...

  36. [36]

    Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,

    15 Preprint Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,

  37. [37]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

  38. [38]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

  39. [39]

    First, in Sec

    16 Preprint A APPENDIX We organize the supplementary material as follows. First, in Sec. B, we address potential questions that may arise from reading the main text. We then report ReCogDrive’s performance on the NA VSIM and DriveBench benchmarks, along with more detailed ablation studies in Sec. C. In Sec. D, we provide details of the training data colle...

  40. [40]

    Therefore, we primarily use the Planning task to evaluate a model’s practical capabilities in dynamic driving scenarios

    Q4.Does DriveVQA accurately reflect the capabilities of a VLM for autonomous driving? DriveBench has shown that many VLMs’ decision accuracy does not degrade with visual quality, indicating a reliance on priors rather than genuine visual understanding. Therefore, we primarily use the Planning task to evaluate a model’s practical capabilities in dynamic dr...

  41. [41]

    97.5 96.3 80.1 93.010099.9 98.3 65.5 97.4 79.8 Hydra-MDP++ (Li et al., 2024)97.996.579.2 93.4100 100.098.9 67.297.7 80.6 ARTEMIS (Feng et al., 2025)98.3 95.1 81.5 97.410099.8 98.6 96.5 98.3 83.1 ReCogDrive 98.395.287.1 97.598.3 99.899.5 96.686.5 83.6 Experiments on NA VSIM with extended metrics.Hydra MDP++ (Li et al.,

  42. [42]

    We evaluate ReCogDrive on NA VSIM using these extended metrics as well

    introduces additional evaluation metrics: Traffic Light Compliance (TL), Lane Keeping Ability (LK), Driving Direction Compliance (DDC) and Extended Comfort (EC) to more comprehensively assess driving performance. We evaluate ReCogDrive on NA VSIM using these extended metrics as well. Tab. 6 1https://github.com/autonomousvision/navsim/issues/116 18 Preprin...

  43. [43]

    10, outputting only trajectory achieves per- formance comparable to adding high-level command guidance, with almost identical PDMS scores

    Effect of VLM guidance modes.As shown in Tab. 10, outputting only trajectory achieves per- formance comparable to adding high-level command guidance, with almost identical PDMS scores. Interestingly, incorporating chain-of-thought reasoning does not further improve the results; instead, it slightly decreases the PDMS score by 0.1. This suggests that the c...

  44. [44]

    CODA-LM(Chen et al., 2024a) comprises 9,768 real-world driving scenarios with 41,722 textual annotations for critical road entities and 21,537 annotations for road corner cases

    is a dataset constructed based on Nuscenes (Caesar et al., 2020), containing 91K multi-view video instruction-response pairs in 17 subtasks. CODA-LM(Chen et al., 2024a) comprises 9,768 real-world driving scenarios with 41,722 textual annotations for critical road entities and 21,537 annotations for road corner cases. OminiDrive(Wang et al., 2024b) covers ...

  45. [45]

    with crafted prompts to generate anno- tations across the full spectrum of autonomous driving tasks on NA VSIM (Dauner et al., 2025). These tasks span perception (e.g., scene description, key object identification, road marking recognition, traffic light classification, vulnerable road user detection), prediction (e.g., motion prediction), plan- ning (e.g...