pith. sign in

arxiv: 2606.05645 · v2 · pith:BMYLTPFBnew · submitted 2026-06-04 · 💻 cs.RO

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Pith reviewed 2026-06-28 01:47 UTC · model grok-4.3

classification 💻 cs.RO
keywords discrete tokensworld modelspolicy learningautonomous drivingvision-action alignmenthierarchical planningtoken editingmulti-task pretraining
0
0 comments X

The pith

A shared discrete token space for observations, states, decisions and actions lets world prediction directly generate driving policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that representing visual observations, future states, high-level decisions and ego actions inside one discrete token vocabulary allows joint multi-task training of world modeling and policy modeling. A reader would care because most current driving systems either copy actions from data or train world models that stay poorly connected to the final planner, producing brittle behavior in changing physical scenes. If the alignment works, action-conditioned future prediction becomes a direct source for policy output instead of requiring separate alignment steps. The method adds hierarchical decision tokens that sketch the plan and then refines dense action tokens in parallel through editing.

Core claim

Discrete-WAM places visual observations, future states, high-level decisions and ego actions inside a single discrete token space, then jointly trains world modeling, world-policy modeling and policy modeling through multi-task and multi-stage pretraining. This alignment makes action-conditioned future prediction serve policy generation without extra mechanisms. Downstream planning decomposes into hierarchical decision prediction followed by confidence-based parallel action-token editing that produces dense future actions efficiently.

What carries the argument

The shared discrete token space that unifies visual observations, future states, high-level decisions and ego actions so that action-conditioned prediction directly supplies policy tokens.

If this is right

  • Action-conditioned future prediction directly supports policy generation on large-scale driving benchmarks.
  • The same model enables controllable future generation and counterfactual evaluation.
  • Surprise-based analysis of the world model becomes possible from the same token predictions.
  • Policy decoding runs in parallel through hierarchical decision tokens and confidence-based action editing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-editing approach could transfer to other embodied tasks where future prediction must stay tightly coupled to control.
  • Removing the need for separate world-model and policy heads might simplify training loops in physical agents.
  • Token-level editing could make it easier to inspect or intervene on specific parts of a planned trajectory.

Load-bearing premise

Putting visual observations, future states, decisions and actions into one shared discrete token space will preserve enough fidelity for action-conditioned prediction to support policy generation without further alignment steps.

What would settle it

On the same large-scale autonomous-driving benchmarks, a version that keeps separate continuous or non-shared token spaces for vision and action produces equal or better planning metrics than the unified discrete version while using comparable compute.

read the original abstract

Autonomous driving requires reasoning about how ego actions shape future world evolution, rather than merely mapping observations to actions. However, most end-to-end methods rely on direct state-to-action imitation, while existing world models often remain weakly aligned with downstream policy generation. We introduce Discrete-WAM, a unified discrete vision-action world-policy framework that represents visual observations, future states, high-level decisions, and ego actions within a shared token space. Built on this discrete alignment, Discrete-WAM jointly trains world modeling, world-policy modeling, and policy modeling through multi-task and multi-stage pretraining, allowing action-conditioned future prediction to directly support policy generation. For downstream planning, Discrete-WAM further decomposes policy generation into hierarchical decision prediction and parallel action-token editing, where the decision token provides a high-level planning skeleton and confidence-based scheduling refines dense future actions efficiently. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves strong planning performance while supporting controllable future generation, counterfactual evaluation, surprise-based world-model analysis, and efficient parallel policy decoding. These results suggest that discrete representation alignment, unified world-policy training, and hierarchical token editing provide a promising design paradigm for physical AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Discrete-WAM, a unified discrete vision-action world-policy framework for autonomous driving. It places visual observations, future states, high-level decisions, and ego actions in a shared discrete token space, jointly trains world modeling, world-policy modeling, and policy modeling via multi-task and multi-stage pretraining, and decomposes downstream planning into hierarchical decision prediction plus parallel action-token editing with confidence-based scheduling. Experiments on large-scale driving benchmarks are reported to show strong planning performance together with controllable future generation, counterfactual evaluation, surprise-based analysis, and efficient parallel decoding.

Significance. If the empirical results hold under the claimed discrete alignment, the work offers a concrete design paradigm that integrates world modeling and policy generation without separate alignment stages, potentially improving sample efficiency and controllability in physical AI systems. The explicit support for counterfactuals and surprise analysis is a notable strength beyond standard planning metrics.

major comments (1)
  1. [§3] The central claim that shared discrete token space plus multi-task pretraining allows action-conditioned future prediction to directly support policy generation without additional alignment mechanisms (abstract and §3) rests on an assumption whose load-bearing status is not fully tested; an ablation removing the joint training stages or the shared vocabulary would be required to isolate whether fidelity is preserved or whether implicit alignment emerges.
minor comments (2)
  1. [§3.2] Notation for the discrete token vocabulary and the hierarchical editing schedule should be introduced with explicit equations rather than prose descriptions to allow reproduction.
  2. [§4] The paper should report the exact token vocabulary size, codebook learning procedure, and any discretization hyperparameters in a dedicated table or subsection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of the unified discrete vision-action framework. We address the single major comment below.

read point-by-point responses
  1. Referee: [§3] The central claim that shared discrete token space plus multi-task pretraining allows action-conditioned future prediction to directly support policy generation without additional alignment mechanisms (abstract and §3) rests on an assumption whose load-bearing status is not fully tested; an ablation removing the joint training stages or the shared vocabulary would be required to isolate whether fidelity is preserved or whether implicit alignment emerges.

    Authors: We agree that the load-bearing role of the joint multi-task and multi-stage pretraining, together with the shared discrete vocabulary, would be more clearly isolated by an explicit ablation that removes either the joint stages or the shared token space. The current multi-stage schedule already contains progressive stages (world modeling first, followed by joint world-policy training), and the reported planning and generation results are obtained under the full unified setting; however, these do not constitute the precise removal requested. We will therefore add the suggested ablation (separate training with and without shared vocabulary) to the revised manuscript to provide direct empirical evidence on whether the claimed direct support emerges from the joint training or from implicit alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and available text describe an empirical framework (shared discrete token space, multi-task pretraining, hierarchical editing) with performance claims on driving benchmarks. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided material. The central claim is scoped to design and empirical results rather than a formal derivation that reduces to its inputs by construction. This matches the expected honest non-finding for papers without visible mathematical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented physical entities are described in the abstract; the contribution is the proposed modeling framework itself.

pith-pipeline@v0.9.1-grok · 5780 in / 1130 out tokens · 38220 ms · 2026-06-28T01:47:21.595822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 29 linked inside Pith

  1. [1]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  2. [2]

    Cosmos-reason1: From physical common sense to embodied reasoning

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

  3. [3]

    Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

    Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

  4. [4]

    Accelerated sampling from masked diffusion models via entropy bounded unmasking

    Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking. InAdvancesin Neural Information Processing Systems, 2025

  5. [5]

    Motus: A unified latent action world model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 35101–35113, 2026

  6. [6]

    nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

  7. [7]

    Confidence-based decoding is provably efficient for diffusion language models.arXiv preprint arXiv:2603.22248, 2026

    Changxiao Cai and Gen Li. Confidence-based decoding is provably efficient for diffusion language models.arXiv preprint arXiv:2603.22248, 2026

  8. [8]

    Pseudo-simulation for autonomous driving

    Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving. arXiv preprint arXiv:2506.04218, 2025

  9. [9]

    Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  10. [10]

    Last-r1: Reinforcing action via adaptive physical latent reasoning for vla models.arXiv preprint arXiv:2604.28192, 2026

    Hao Chen, Jiaming Liu, Zhonghao Yan, Nuowei Han, Renrui Zhang, Chenyang Gu, Jialin Gao, Ziyu Guo, Siyuan Qian, Yinxi Wang, et al. Last-r1: Reinforcing action via adaptive physical latent reasoning for vla models.arXiv preprint arXiv:2604.28192, 2026

  11. [11]

    End-to-end autonomous driving: Challenges and frontiers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

  12. [12]

    Milestones in autonomous driving and intelligent vehicles: Survey of surveys.IEEE Transactions on Intelligent Vehicles, 8(2):1046–1056, 2022

    Long Chen, Yuchen Li, Chao Huang, Bai Li, Yang Xing, Daxin Tian, Li Li, Zhongxu Hu, Xiaoxiang Na, Zixuan Li, et al. Milestones in autonomous driving and intelligent vehicles: Survey of surveys.IEEE Transactions on Intelligent Vehicles, 8(2):1046–1056, 2022

  13. [13]

    Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

  14. [14]

    Optimal inference schedules for masked diffusion models.arXiv preprint arXiv:2511.04647, 2025

    Sitan Chen, Kevin Cong, and Jerry Li. Optimal inference schedules for masked diffusion models.arXiv preprint arXiv:2511.04647, 2025

  15. [15]

    Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

    Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 239–256. Springer, 2024

  16. [16]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

  17. [17]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024. 20

  18. [18]

    Duncan-Johnson and Emanuel Donchin

    Carolyn C. Duncan-Johnson and Emanuel Donchin. On quantifying surprise: The variation of event-related potentials with subjective probability.Psychophysiology, 14(5):456–467, 1977

  19. [19]

    Theoretical benefit and limitation of diffusion language model

    Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, and Di He. Theoretical benefit and limitation of diffusion language model. InAdvancesin Neural Information Processing Systems, 2025

  20. [20]

    Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning

    Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Xiangyu Li, Wenyu Liu, Qian Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning. Advancesin Neural Information Processing Systems, 38:32551–32576, 2026

  21. [21]

    Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024

  22. [22]

    Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  23. [23]

    World models for autonomous driving: An initial survey.IEEE Transactionson Intelligent Vehicles, 2024

    Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey.IEEE Transactionson Intelligent Vehicles, 2024

  24. [24]

    ipad: Iterative proposal-centric end-to-end autonomous driving

    Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, and Chen Lv. ipad: Iterative proposal-centric end-to-end autonomous driving. arXiv preprint arXiv:2505.15111, 2025

  25. [25]

    The integration of prediction and planning in deep learning automated driving systems: A review.IEEE Transactionson Intelligent Vehicles, 10(5):3626–3643, 2024

    Steffen Hagedorn, Marcel Hallgarten, Martin Stoll, and Alexandru Paul Condurache. The integration of prediction and planning in deep learning automated driving systems: A review.IEEE Transactionson Intelligent Vehicles, 10(5):3626–3643, 2024

  26. [26]

    Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  27. [27]

    Drivingworld: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024

    Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Drivingworld: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024

  28. [28]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  29. [29]

    Automot: A unified vision-language-action model with asynchronous mixture-of-transformers for end-to-end autonomous driving.arXiv preprint arXiv:2603.14851, 2026

    Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, and Chen Lv. Automot: A unified vision-language-action model with asynchronous mixture-of-transformers for end-to-end autonomous driving.arXiv preprint arXiv:2603.14851, 2026

  30. [30]

    Mindvla-u1: Vla beats va with unified streaming architecture for autonomous driving

    Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, and Hongsheng Li. Mindvla-u1: Vla beats va with unified streaming architecture for autonomous driving. arXiv preprint arXiv:2605.12624, 2026

  31. [31]

    Gameformer: Game-theoretic modeling and learning of transformer- based interactive prediction and planning for autonomous driving

    Zhiyu Huang, Haochen Liu, and Chen Lv. Gameformer: Game-theoretic modeling and learning of transformer- based interactive prediction and planning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3903–3913, 2023

  32. [32]

    Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

    Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

  33. [33]

    A survey on vision-language-action models for autonomous driving

    Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

  34. [34]

    Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints, 2025

    Peter Karkus, Maximilian Igl, Yuxiao Chen, Kashyap Chitta, Jef Packer, Bertrand Douillard, Ran Tian, Alexander Naumann, Guillermo Garcia-Cobo, Shuhan Tan, et al. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints, 2025

  35. [35]

    Kakade, and Sitan Chen

    Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. InProceedings of the 42nd International Conference 21 on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 30749–30768. PMLR, 2025

  36. [36]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

  37. [37]

    3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

    Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

  38. [38]

    Error bounds and optimal schedules for masked diffusions with factorized approximations

    Hugo Lavenant and Giacomo Zanella. Error bounds and optimal schedules for masked diffusions with factorized approximations. arXiv preprint arXiv:2510.25544, 2025

  39. [39]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

  40. [40]

    A convergence theory for diffusion language models: An information-theoretic perspective

    Gen Li and Changxiao Cai. A convergence theory for diffusion language models: An information-theoretic perspective. arXiv preprint arXiv:2505.21400, 2025

  41. [41]

    Discrete diffusion for reflective vision-language-action models in autonomous driving.arXiv preprint arXiv:2509.20109, 2025

    Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, and Xianpeng Lang. Discrete diffusion for reflective vision-language-action models in autonomous driving.arXiv preprint arXiv:2509.20109, 2025

  42. [42]

    Drivevla-w0: World models amplify data scaling law in autonomous driving

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796, 2025

  43. [43]

    End-to-end driving with online trajectory evaluation via bev world model

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

  44. [44]

    Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  45. [45]

    Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving

    Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, et al. Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving. arXiv preprint arXiv:2604.02190, 2026

  46. [46]

    Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  47. [47]

    Hydra-next: Robust closed-loop driving with open-loop training

    Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27305–27314, 2025

  48. [48]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  49. [49]

    Model-based policy adaptation for closed-loop end-to-end autonomous driving

    Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, and Ding Zhao. Model-based policy adaptation for closed-loop end-to-end autonomous driving. InWorkshop on Foundation Models Meet Embodied Agents at CVPR 2025, 2025

  50. [50]

    Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model

    Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, et al. Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model. arXiv preprint arXiv:2512.11226, 2025

  51. [51]

    Reasoning multi-agent behavioral topology for interactive autonomous driving.Advancesin Neural Information Processing Systems, 37:92605–92637, 2024

    Haochen Liu, Li Chen, Yu Qiao, Chen Lv, and Hongyang Li. Reasoning multi-agent behavioral topology for interactive autonomous driving.Advancesin Neural Information Processing Systems, 37:92605–92637, 2024

  52. [52]

    Hybrid-prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2597–2614, 2025

    Haochen Liu, Zhiyu Huang, Wenhui Huang, Haohan Yang, Xiaoyu Mo, and Chen Lv. Hybrid-prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2597–2614, 2025. 22

  53. [53]

    Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2026

    Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2026

  54. [54]

    Think while you generate: Discrete diffusion with planned denoising.arXivpreprintarXiv:2410.06264, 2024

    Sulin Liu, Juno Nam, Andrew Campbell, Hannes St"ark, Yilun Xu, Tommi Jaakkola, and Rafael G’omez- Bombarelli. Think while you generate: Discrete diffusion with planned denoising.arXivpreprintarXiv:2410.06264, 2024

  55. [55]

    Object-centric learning with slot attention.Advancesin neural information processing systems, 33:11525–11538, 2020

    Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention.Advancesin neural information processing systems, 33:11525–11538, 2020

  56. [56]

    Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation

    Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation. InEuropean conference on computer vision, pages 329–345. Springer, 2024

  57. [57]

    Plan for speed: Dilated scheduling for masked diffusion language models

    Omer Luxembourg, Haim Permuter, and Eliya Nachmani. Plan for speed: Dilated scheduling for masked diffusion language models. arXiv preprint arXiv:2506.19037, 2025

  58. [58]

    Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

    Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

  59. [59]

    dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning

    Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, and Chaowei Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning. arXiv preprint arXiv:2512.04459, 2025

  60. [60]

    Jump your steps: Opti- mizing sampling schedule of discrete diffusion models

    Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, and Yuki Mitsufuji. Jump your steps: Opti- mizing sampling schedule of discrete diffusion models. InInternational Conference on Learning Representations, volume 2025, pages 96272–96300, 2025

  61. [61]

    Path planning for masked diffusion model sampling.arXiv preprint arXiv:2502.03540, 2025

    Fred Zhangzhi Peng, Zachary Bezemek, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Alexander Tong, and Pranam Chatterjee. Path planning for masked diffusion model sampling.arXiv preprint arXiv:2502.03540, 2025

  62. [62]

    Learn from your mistakes: Self-correcting masked diffusion models

    Yair Schiff, Omer Belhasin, Roy Uziel, Guanghan Wang, Marianne Arriola, Gilad Turok, Michael Elad, and Volodymyr Kuleshov. Learn from your mistakes: Self-correcting masked diffusion models. arXiv preprint arXiv:2602.11590, 2026

  63. [63]

    Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.Advances in Neural Information Processing Systems, 38: 81565–81585, 2026

    Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and ZHAO-XIANG ZHANG. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.Advances in Neural Information Processing Systems, 38: 81565–81585, 2026

  64. [64]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  65. [65]

    Drivewam: Video generative priors enable scalable world-action modeling for autonomous driving.arXiv preprint arXiv:2605.28544, 2026

    Chen Shi, Jinrui Xu, Shaoshuai Shi, Kehua Sheng, Bo Zhang, and Li Jiang. Drivewam: Video generative priors enable scalable world-action modeling for autonomous driving.arXiv preprint arXiv:2605.28544, 2026

  66. [66]

    Dinov3.arXiv preprint arXiv:2508.10104, 2025

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  67. [67]

    Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving

    Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, and Yadan Luo. Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving. arXiv preprint arXiv:2507.04049, 2025

  68. [68]

    Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

    Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, and Sifa Zheng. Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

  69. [69]

    Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

    Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, et al. Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

  70. [70]

    Scene as occupancy

    Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406–8415, 2023. 23

  71. [71]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advancesin Neural Information Processing Systems, 2017

  72. [72]

    Reflectdrive-2: Reinforcement-learning-aligned self-editing for discrete diffusion driving

    Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, and Kun Zhan. Reflectdrive-2: Reinforcement-learning-aligned self-editing for discrete diffusion driving. arXiv preprint arXiv:2605.04647, 2026

  73. [73]

    Latent-wam: Latent world action modeling for end-to-end autonomous driving

    Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, et al. Latent-wam: Latent world action modeling for end-to-end autonomous driving. arXiv preprint arXiv:2603.24581, 2026

  74. [74]

    Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving

    Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, and Cheng Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving. arXiv preprint arXiv:2601.22032, 2026

  75. [75]

    Social interactions for autonomous driving: A review and perspectives.Foundations and Trends® in Robotics, 10(3-4):198–377, 2022

    Wenshuo Wang, Letian Wang, Chengyuan Zhang, Changliu Liu, and Lijun Sun. Social interactions for autonomous driving: A review and perspectives.Foundations and Trends® in Robotics, 10(3-4):198–377, 2022

  76. [76]

    Drivedreamer: Towards real-world-drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024

  77. [77]

    Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

  78. [78]

    Occllama: An occupancy-language-action generative world model for autonomous driving.arXiv preprint arXiv:2409.03272, 2024

    Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for autonomous driving.arXiv preprint arXiv:2409.03272, 2024

  79. [79]

    Para-drive: Parallelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

  80. [80]

    Drivelaw: Unifying planning and video generation in a latent driving world

    Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 39701–39712, 2026

Showing first 80 references.