Recognition: 3 theorem links
· Lean TheoremReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Pith reviewed 2026-05-15 07:32 UTC · model grok-4.3
The pith
ReCogDrive combines a vision-language model for cognition with a reinforced diffusion planner to generate feasible, safe driving trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReCogDrive unifies understanding and planning by pairing an autoregressive vision-language model with a diffusion planner. Human driving cognition is transferred via a hierarchical pipeline of generation, refinement, and quality control. The model's priors are injected into the diffusion planner to produce stable continuous trajectories, and DiffGRPO reinforcement is applied to improve safety and comfort, resulting in state-of-the-art benchmark performance.
What carries the argument
The hierarchical data pipeline that embeds human driving cognition into the VLM, followed by injection of those priors into a diffusion planner reinforced by DiffGRPO for trajectory generation.
Load-bearing premise
The three-stage data pipeline transfers genuine human driving cognition into the model without embedding dataset-specific biases that block performance in real driving conditions.
What would settle it
Real-world closed-loop driving tests that measure collision rate and trajectory feasibility on routes not seen in the training data, compared against prior VLM-only planners.
read the original abstract
Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. All code, model weights, and datasets will be made publicly available to facilitate subsequent research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ReCogDrive, a framework for end-to-end autonomous driving that integrates an autoregressive VLM (instilled with human driving cognition via a three-stage hierarchical data pipeline of generation, refinement, and quality control) with a diffusion planner to resolve language-action mismatches, further optimized via Diffusion Group Relative Policy Optimization (DiffGRPO) for safety and comfort; it claims SOTA performance on the NAVSIM and Bench2Drive benchmarks plus strong qualitative scene comprehension on DriveBench.
Significance. If the SOTA claims and generalization hold after proper validation, the work would meaningfully advance VLM-based driving by demonstrating a practical unification of cognitive priors with continuous trajectory generation, potentially improving handling of long-tail scenarios while maintaining real-time feasibility; the public release of code, weights, and data would further strengthen its impact.
major comments (3)
- [Abstract] Abstract and Experiments section: the SOTA claim on NAVSIM and Bench2Drive is stated without any quantitative baseline numbers, statistical significance tests, error bars, or ablation tables, leaving the central performance result impossible to assess from the provided information.
- [Method] Method section (hierarchical data pipeline): no ablation studies, distribution-shift metrics, or out-of-distribution tests are reported to verify that the generation/refinement/quality-control stages instill transferable cognition rather than benchmark-specific biases; if the pipeline sources overlap with NAVSIM/Bench2Drive simulators, gains may reflect distribution matching instead of genuine cognitive transfer.
- [Method] Method section (DiffGRPO): the reinforcement stage is introduced as a novel component with free hyperparameters, yet no comparison to standard policy-gradient or diffusion-specific RL baselines is supplied, nor is any sensitivity analysis given for those hyperparameters.
minor comments (2)
- [Abstract] Abstract: the acronym DiffGRPO is used before any expansion or definition, which reduces immediate readability.
- [Experiments] Qualitative results: the DriveBench examples would benefit from explicit failure-case analysis to substantiate the 'strong scene comprehension' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment point by point below and will revise the manuscript to incorporate additional quantitative details, ablations, and analyses where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments section: the SOTA claim on NAVSIM and Bench2Drive is stated without any quantitative baseline numbers, statistical significance tests, error bars, or ablation tables, leaving the central performance result impossible to assess from the provided information.
Authors: We agree that the abstract would be strengthened by including key quantitative results. The full Experiments section contains baseline comparisons, but we will revise the abstract to report specific metrics (e.g., NAVSIM and Bench2Drive scores with improvements over baselines). We will also add error bars, statistical significance tests, and ensure ablation tables are prominently featured in the revised Experiments section. revision: yes
-
Referee: [Method] Method section (hierarchical data pipeline): no ablation studies, distribution-shift metrics, or out-of-distribution tests are reported to verify that the generation/refinement/quality-control stages instill transferable cognition rather than benchmark-specific biases; if the pipeline sources overlap with NAVSIM/Bench2Drive simulators, gains may reflect distribution matching instead of genuine cognitive transfer.
Authors: We acknowledge the need for explicit verification of cognitive transfer. In revision we will add ablation studies that isolate each pipeline stage and quantify performance drops. We will clarify that the data sources include diverse real-world logs and synthetic scenarios beyond the benchmark simulators and will report distribution-shift metrics. Full OOD evaluation on entirely new simulators is noted as a limitation for future work. revision: partial
-
Referee: [Method] Method section (DiffGRPO): the reinforcement stage is introduced as a novel component with free hyperparameters, yet no comparison to standard policy-gradient or diffusion-specific RL baselines is supplied, nor is any sensitivity analysis given for those hyperparameters.
Authors: We agree that direct comparisons would better substantiate the contribution of DiffGRPO. In the revised manuscript we will include results against standard policy-gradient methods and other diffusion RL baselines, together with a sensitivity analysis on the key hyperparameters, all added to the Experiments section. revision: yes
Circularity Check
No circularity detected in ReCogDrive derivation or claims
full rationale
The paper's SOTA claims on NAVSIM and Bench2Drive rest on external benchmark evaluations rather than any quantities defined in terms of the method's own fitted parameters or self-referential derivations. The hierarchical data pipeline (generation, refinement, quality control) is presented as an input-generation process to instill cognition into the VLM, followed by integration with a diffusion planner and DiffGRPO reinforcement; none of these steps reduce by construction to the target performance metrics or to self-citations that bear the central load. No equations, uniqueness theorems, or ansatzes are invoked that collapse the claimed cognitive transfer or trajectory generation back to the inputs by definition. The derivation chain remains self-contained against the stated external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- DiffGRPO optimization hyperparameters
axioms (1)
- domain assumption Conditioning a diffusion planner on VLM-derived driving priors produces feasible and stable trajectories.
invented entities (1)
-
DiffGRPO
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort.
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.
-
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
GSDrive: Reinforcing Driving Policies by Multi-mode Trajectory Probing with 3D Gaussian Splatting Environment
GSDrive improves end-to-end driving policies through 3D Gaussian Splatting simulation and multi-mode trajectory probing that supplies dense, differentiable rewards for reinforcement learning.
-
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
Multi-ORFT improves closed-loop multi-agent driving planners by coupling scene-consistent diffusion pre-training with stable online RL post-training, reducing collisions and off-road rates while increasing speed on th...
-
Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
-
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
-
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
-
DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Is a 3d-tokenized llm the key to reliable autonomous driving? arXiv preprint arXiv:2405.18361,
Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, et al. Is a 3d-tokenized llm the key to reliable autonomous driving? arXiv preprint arXiv:2405.18361,
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-languageaction flow model for general robot control.arXiv preprint arXiv:2410.24164, 2(3):5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles
Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810,
work page internal anchor Pith review arXiv
-
[7]
Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,
-
[8]
Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases.arXiv preprint arXiv:2404.10595, 2024a. Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang ...
-
[9]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,
Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie-Francine Moens. Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,
-
[11]
Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yanjun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580,
-
[12]
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755,
-
[13]
arXiv preprint arXiv:2502.13144 (2025) 4
Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.arXiv preprint arXiv:2502.13144,
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,
-
[19]
Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024a. Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous dr...
-
[20]
12 Preprint Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025a. Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving wi...
-
[21]
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,
-
[22]
Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800,
-
[23]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/. Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundat...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a
Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a. Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023b. Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, ...
-
[25]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d
Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 194–210. Springer,
work page 2020
-
[26]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024
Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588,
-
[28]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15120–15130, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, M...
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[29]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
14 Preprint Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving.arXiv preprint arXiv:2412.15208,
-
[32]
arXiv preprint arXiv:2408.03601 (2024) 13
Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601,
-
[33]
Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,
-
[34]
Diankun Zhang, Guoan Wang, Runwen Zhu, Jianbo Zhao, Xiwu Chen, Siyu Zhang, Jiahao Gong, Qibin Zhou, Wenyuan Zhang, Ningzi Wang, et al. Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving.arXiv preprint arXiv:2404.06892, 2024a. Dongkun Zhang, Jiaming Liang, Ke Guo, Sha Lu, Qi Wang, Rong Xiong, Zhenwei Miao, and Yue Wang. Car...
-
[35]
Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024b
Ruijun Zhang, Xianda Guo, Wenzhao Zheng, Chenming Zhang, Kurt Keutzer, and Long Chen. Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024b. Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, and Chen Lv. Wisead: Knowledge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951...
-
[36]
15 Preprint Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,
-
[37]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
16 Preprint A APPENDIX We organize the supplementary material as follows. First, in Sec. B, we address potential questions that may arise from reading the main text. We then report ReCogDrive’s performance on the NA VSIM and DriveBench benchmarks, along with more detailed ablation studies in Sec. C. In Sec. D, we provide details of the training data colle...
work page 2024
-
[40]
Q4.Does DriveVQA accurately reflect the capabilities of a VLM for autonomous driving? DriveBench has shown that many VLMs’ decision accuracy does not degrade with visual quality, indicating a reliance on priors rather than genuine visual understanding. Therefore, we primarily use the Planning task to evaluate a model’s practical capabilities in dynamic dr...
work page 2022
-
[41]
97.5 96.3 80.1 93.010099.9 98.3 65.5 97.4 79.8 Hydra-MDP++ (Li et al., 2024)97.996.579.2 93.4100 100.098.9 67.297.7 80.6 ARTEMIS (Feng et al., 2025)98.3 95.1 81.5 97.410099.8 98.6 96.5 98.3 83.1 ReCogDrive 98.395.287.1 97.598.3 99.899.5 96.686.5 83.6 Experiments on NA VSIM with extended metrics.Hydra MDP++ (Li et al.,
work page 2024
-
[42]
We evaluate ReCogDrive on NA VSIM using these extended metrics as well
introduces additional evaluation metrics: Traffic Light Compliance (TL), Lane Keeping Ability (LK), Driving Direction Compliance (DDC) and Extended Comfort (EC) to more comprehensively assess driving performance. We evaluate ReCogDrive on NA VSIM using these extended metrics as well. Tab. 6 1https://github.com/autonomousvision/navsim/issues/116 18 Preprin...
-
[43]
Effect of VLM guidance modes.As shown in Tab. 10, outputting only trajectory achieves per- formance comparable to adding high-level command guidance, with almost identical PDMS scores. Interestingly, incorporating chain-of-thought reasoning does not further improve the results; instead, it slightly decreases the PDMS score by 0.1. This suggests that the c...
work page 2019
-
[44]
is a dataset constructed based on Nuscenes (Caesar et al., 2020), containing 91K multi-view video instruction-response pairs in 17 subtasks. CODA-LM(Chen et al., 2024a) comprises 9,768 real-world driving scenarios with 41,722 textual annotations for critical road entities and 21,537 annotations for road corner cases. OminiDrive(Wang et al., 2024b) covers ...
work page 2020
-
[45]
with crafted prompts to generate anno- tations across the full spectrum of autonomous driving tasks on NA VSIM (Dauner et al., 2025). These tasks span perception (e.g., scene description, key object identification, road marking recognition, traffic light classification, vulnerable road user detection), prediction (e.g., motion prediction), plan- ning (e.g...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.