pith. machine review for the scientific record. sign in

arxiv: 2604.09059 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: unknown

Learning Vision-Language-Action World Models for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Vision-Language-Action modelsWorld modelsAutonomous drivingTrajectory planningFuture scene generationReinforcement learningnuScenes dataset
0
0 comments X

The pith

VLA-World unifies predictive imagination and reflective reasoning to enhance autonomous driving foresight and safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLA-World, a vision-language-action world model that generates future scene images guided by predicted trajectories and then reasons over those imagined frames to refine the trajectories. This addresses the limitations of standard VLA models lacking temporal dynamics and world models lacking reasoning. By curating a new dataset and using a three-stage training process including reinforcement learning, the approach aims to improve both planning decisions and the quality of predicted future scenes. A sympathetic reader would care because better foresight in self-driving systems could lead to safer navigation in dynamic environments. The model demonstrates superior performance on benchmarks for planning and future generation compared to existing methods.

Core claim

VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues. It then reasons over this self-generated future imagined frame to refine the predicted trajectory. Supported by the nuScenes-GR-20K dataset and three-stage training, this unification of imagination and reflection leads to higher performance and better interpretability in autonomous driving tasks.

What carries the argument

The action-guided future-frame generation followed by reflective reasoning over the imagined scene, which refines the trajectory prediction.

If this is right

  • Improved trajectory prediction accuracy by incorporating future scene reasoning.
  • Better interpretability of the model's driving decisions through explicit future simulation.
  • Enhanced performance on both planning and scene generation benchmarks over prior VLA and world model approaches.
  • More robust handling of temporal dynamics and global world consistency in driving scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method suggests that closing the loop between action prediction and visual imagination could generalize to other robotics domains like manipulation or navigation.
  • If generation errors are mitigated, the reflective step might enable safer long-horizon planning without manual rule-based safety checks.
  • Future work could test whether this self-refinement reduces the need for extensive human-labeled trajectory data.

Load-bearing premise

That reasoning over the model's own generated future frames will consistently correct trajectory errors rather than amplify inaccuracies from imperfect image synthesis.

What would settle it

A controlled ablation experiment showing that disabling the reasoning-over-imagined-frame step results in no performance drop or even improvement on the planning benchmarks would falsify the benefit of the reflective component.

Figures

Figures reproduced from arXiv: 2604.09059 by Bailan Feng, Chao Ma, Guodongfang Zhao, Guoqing Wang, Pin Tang, Xiangxuan Ren.

Figure 1
Figure 1. Figure 1: Visual overview of VLA-World. The model learns through three progressive stages. We first activate visual generation by [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the (a) VLA, (b) World Model, and (c) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The illustration of the three-stage training and inference pipeline of VLA-World. Our training pipeline consists of three key [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of our VLA-World compared with the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data sample of (a) pretraining stage, (b) supervised fine [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between our VLA-World and the state-of-the-art FSDrive [ [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of 3-second future trajectory predictions generated by our VLA-World and the state-of-the-art FSDrive [ [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VLA-World, a vision-language-action world model for autonomous driving that unifies predictive imagination with reflective reasoning. The pipeline first conditions next-frame generation on an action-derived initial trajectory to capture spatial-temporal evolution, then reasons over the self-generated imagined frame to refine the trajectory. It curates the nuScenes-GR-20K generative reasoning dataset from nuScenes and employs a three-stage training strategy (pretraining, supervised fine-tuning, and reinforcement learning). The central claim is that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks.

Significance. If the results hold under rigorous validation, this work could meaningfully advance end-to-end autonomous driving by bridging world models' predictive capabilities with VLA-style reasoning, potentially improving foresight and interpretability. The dataset curation and staged training approach are constructive contributions that may enable follow-on research. The significance is limited by the absence of targeted validation for the core refinement loop.

major comments (3)
  1. [§4] §4 (Experiments): No ablation is reported that removes only the reflective reasoning module while keeping the initial trajectory-guided generation fixed. This is load-bearing for the headline claim, as the superiority over baselines is attributed to the reasoning step refining trajectories; without it, it is impossible to determine whether the loop provides net benefit or amplifies generation artifacts.
  2. [§3 and §4] §3 (Method) and §4 (Experiments): The manuscript provides no per-scene or per-metric correlation between future-frame generation quality (e.g., FID, PSNR, or LPIPS on nuScenes-GR-20K) and the delta in planning metrics (e.g., collision rate or trajectory error) before versus after reasoning. This leaves the assumption that reasoning reliably refines rather than compounds errors untested.
  3. [§4] §4 (Experiments): Failure-case analysis is absent; there is no examination of scenes where low-quality imagined frames lead to worse final trajectories than the initial action-derived prediction, which would directly address the risk of error propagation highlighted in the pipeline design.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., planning metric improvements) to substantiate the superiority claim rather than stating it qualitatively.
  2. [§3] A diagram or pseudocode for the three-stage training pipeline and the exact conditioning of generation on the initial trajectory would improve clarity of the method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validating the reflective reasoning component in our pipeline. We have addressed each point by adding the requested analyses to the revised manuscript. Our responses are provided below.

read point-by-point responses
  1. Referee: §4 (Experiments): No ablation is reported that removes only the reflective reasoning module while keeping the initial trajectory-guided generation fixed. This is load-bearing for the headline claim, as the superiority over baselines is attributed to the reasoning step refining trajectories; without it, it is impossible to determine whether the loop provides net benefit or amplifies generation artifacts.

    Authors: We agree that isolating the contribution of the reflective reasoning module is necessary to support the central claim. In the revised manuscript, we have added an ablation study in Section 4 that compares the full VLA-World model to a variant using only the action-guided generation step without reflective reasoning. The results show that including the reasoning module yields consistent gains in planning metrics such as lower collision rates and reduced trajectory error, confirming a net benefit. The updated experiments and table are now included. revision: yes

  2. Referee: §3 (Method) and §4 (Experiments): The manuscript provides no per-scene or per-metric correlation between future-frame generation quality (e.g., FID, PSNR, or LPIPS on nuScenes-GR-20K) and the delta in planning metrics (e.g., collision rate or trajectory error) before versus after reasoning. This leaves the assumption that reasoning reliably refines rather than compounds errors untested.

    Authors: We acknowledge the value of correlating generation quality with planning improvements to test the refinement assumption. We have added this analysis to the revised Section 4, reporting per-metric correlations across the test set (e.g., between FID and trajectory error delta). A positive correlation is observed, indicating that better generation quality is associated with larger planning gains after reasoning. Per-scene breakdowns are provided for representative examples due to high scene variability; aggregate statistics and discussion of limitations are included. revision: yes

  3. Referee: §4 (Experiments): Failure-case analysis is absent; there is no examination of scenes where low-quality imagined frames lead to worse final trajectories than the initial action-derived prediction, which would directly address the risk of error propagation highlighted in the pipeline design.

    Authors: We recognize that failure-case analysis is important for addressing potential error propagation. In the revised manuscript, we have added a dedicated failure-case subsection in Section 4. This examines scenes where low-quality imagined frames (high FID/LPIPS) lead to final trajectories worse than the initial prediction. We provide quantitative frequency statistics and qualitative examples, noting that such cases are infrequent and typically arise in complex dynamic scenes. Mitigation approaches are discussed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline is data-driven and externally benchmarked

full rationale

The VLA-World architecture is presented as a trained multimodal model that generates next-frame images conditioned on action-derived trajectories and then applies reasoning to refine those trajectories. This is supported by a curated external dataset (nuScenes-GR-20K) and a three-stage training procedure (pretraining, SFT, RL) evaluated on standard planning and generation benchmarks. No equations, uniqueness theorems, or central claims reduce by construction to fitted parameters or self-citations; the derivation chain relies on independent data and external baselines rather than self-referential definitions or imported ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not provide enough detail to identify specific free parameters, axioms, or invented entities. The model architecture, dataset curation, and training stages are described at a high level without mathematical formulations or assumptions listed.

pith-pipeline@v0.9.0 · 5535 in / 1215 out tokens · 83467 ms · 2026-05-10T17:02:57.417442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).

  2. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

Reference graph

Works this paper leans on

87 extracted references · 45 canonical work pages · cited by 1 Pith paper · 23 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, pages 23716–23736, 2022. 1

  3. [3]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watan- abe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. InWACV, pages 1933–1943, 2025. 1

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 3

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

  6. [6]

    Belkhale, T

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 3

  7. [7]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  8. [8]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 3

  9. [9]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 3

  10. [10]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InCVPR, pages 11621– 11631, 2020. 2, 6, 7, 3

  11. [11]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025. 2, 3

  12. [12]

    End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024. 1

  13. [13]

    Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024. 1, 3, 6, 4

  14. [14]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 2

  15. [15]

    Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

    Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. InECCV, pages 239–256, 2024. 1

  16. [16]

    Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

    Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025. 1

  17. [17]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. InICML, 2023. 3

  18. [18]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, pages 12873–12883, 2021. 5

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  20. [20]

    Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving

    Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, and Jian- bing Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InAAAI, pages 3347–3355, 2025. 1

  21. [21]

    Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control. InCVPR, pages 22404–22415, 2025. 7

  22. [22]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023. 1

  23. [23]

    St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In ECCV, pages 533–549, 2022. 6, 7, 4

  24. [24]

    Drivingworld: Constructing world model for autonomous driving via video gpt

    Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024. 3

  25. [25]

    Planning-oriented autonomous driv- ing

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driv- ing. InCVPR, pages 17853–17862, 2023. 1, 3, 6, 7, 4

  26. [26]

    Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, and Ze- qun et al. Jie. Making large language models better planners with reasoning-decision alignment. InECCV, pages 73–90,

  27. [27]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

  28. [28]

    Available: https://arxiv.org/abs/2311.13549

    Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv:2311.13549, 2023. 1

  29. [29]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InICCV, pages 8306–8316,

  30. [30]

    Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xing- gang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608, 2025. 1, 2, 3

  31. [31]

    Drivegan: Towards a controllable high-quality neural simulation

    Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InCVPR, pages 5820–5829, 2021. 6, 7, 3

  32. [32]

    Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024

    Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024. 3

  33. [33]

    Driving everywhere with large language model policy adaptation

    Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. InCVPR, pages 14948–14957, 2024. 1

  34. [34]

    arXiv preprint arXiv:2510.18313 (2025)

    Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 3

  35. [35]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InICML, pages 12888–12900, 2022. 3

  36. [36]

    Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742, 2023. 3

  37. [37]

    Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025

    Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025. 7

  38. [38]

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024. 6, 7

  39. [39]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916,

  40. [40]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 1, 3

  41. [41]

    Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation

    Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation. InECCV, pages 329–345, 2024. 1

  42. [42]

    Dolphins: Multimodal language model for driving

    Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InECCV, pages 403–420, 2024. 1

  43. [43]

    A language agent for au- tonomous driving

    Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023. 3

  44. [44]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InCVPR, pages 15522– 15533, 2024. 1

  45. [45]

    Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing. InECCV, pages 292–308, 2024. 3

  46. [46]

    arXiv preprint arXiv:2505.15298 1,

    Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025. 3

  47. [47]

    Grounding Everything in Tokens for Multimodal Large Language Models

    Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, and Chao Ma. Grounding everything in tokens for multimodal large language models.arXiv preprint arXiv:2512.10554, 2025. 3

  48. [48]

    Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,

    Ayushman Sarkar, Mohd Yamani Idna Idris, and Zhenyu Yu. Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025. 3

  49. [49]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 1

  50. [50]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InCVPR, pages 15120–15130, 2024. 3

  51. [51]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 6, 1

  52. [52]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 3

  53. [53]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024. 2, 3

  54. [54]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 3

  55. [55]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, pages 6309–6318, 2017. 5

  56. [56]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6

  57. [57]

    Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InCVPR, pages 22442– 22452, 2025. 3, 6, 7, 4

  58. [58]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

    Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023. 3

  59. [59]

    DriveDreamer: Towards real-world-driven world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv:2309.09777, 2023. 1, 3, 6, 7, 4

  60. [60]

    WorldDreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985,

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, and Jiwen Lu. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv:2401.09985, 2024

  61. [61]

    Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InCVPR, pages 14749–14759, 2024. 1, 3, 6, 7, 4

  62. [62]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022. 3

  63. [63]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InCVPR, pages 12966–12977, 2025. 2

  64. [64]

    Mars: An instance-aware, modular and realistic simulator for autonomous driving

    Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, et al. Mars: An instance-aware, modular and realistic simulator for autonomous driving. InCAAI, pages 3–15, 2023. 3

  65. [65]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and genera- tion.arXiv preprint arXiv:2408.12528, 2024. 2

  66. [66]

    Occ-llm: Enhancing autonomous driving with occupancy-based large language models,

    Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing au- tonomous driving with occupancy-based large language mod- els.arXiv:2502.06419, 2025. 3

  67. [67]

    Wong, Zhenguo Li, and Hengshuang Zhao

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.arXiv preprint arXiv:2310.01412,

  68. [68]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 3

  69. [69]

    Generalized predictive model for autonomous driving

    Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InCVPR, pages 14662–14672, 2024. 3, 7

  70. [70]

    ReSim: Reliable World Simulation for Autonomous Driving

    Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981, 2025

  71. [71]

    Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving

    Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. In AAAI, pages 9327–9335, 2025. 3

  72. [72]

    DriveMoE: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving.arXiv preprint arXiv:2505.16278,

  73. [73]

    AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

    Zhenlong Yuan, Jing Tang, Jinguo Luo, Rui Chen, Chengxuan Qian, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r 2: Incentivizing reasoning and self- reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025. 3

  74. [74]

    Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving,

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yi- fan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025. 2, 3, 4, 6, 7, 8, 5

  75. [75]

    Feedback-guided autonomous driving

    Jimuyang Zhang, Zanming Huang, Arijit Ray, and Eshed Ohn-Bar. Feedback-guided autonomous driving. InCVPR, pages 15000–15011, 2024. 7

  76. [76]

    Chatscene: Knowledge- enabled safety-critical scenario generation for autonomous vehicles

    Jiawei Zhang, Chejian Xu, and Bo Li. Chatscene: Knowledge- enabled safety-critical scenario generation for autonomous vehicles. InCVPR, pages 15459–15469, 2024. 1

  77. [77]

    Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 2

  78. [78]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. In ECCV, pages 55–72, 2024. 3, 7

  79. [79]

    Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv:2402.11502, 2024

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv:2402.11502, 2024

  80. [80]

    arXiv preprint arXiv:2412.09627 (2024)

    Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driv- ing with large world model.arXiv preprint arXiv:2412.09627,

Showing first 80 references.