Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning
Pith reviewed 2026-06-28 01:47 UTC · model grok-4.3
The pith
A shared discrete token space for observations, states, decisions and actions lets world prediction directly generate driving policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Discrete-WAM places visual observations, future states, high-level decisions and ego actions inside a single discrete token space, then jointly trains world modeling, world-policy modeling and policy modeling through multi-task and multi-stage pretraining. This alignment makes action-conditioned future prediction serve policy generation without extra mechanisms. Downstream planning decomposes into hierarchical decision prediction followed by confidence-based parallel action-token editing that produces dense future actions efficiently.
What carries the argument
The shared discrete token space that unifies visual observations, future states, high-level decisions and ego actions so that action-conditioned prediction directly supplies policy tokens.
If this is right
- Action-conditioned future prediction directly supports policy generation on large-scale driving benchmarks.
- The same model enables controllable future generation and counterfactual evaluation.
- Surprise-based analysis of the world model becomes possible from the same token predictions.
- Policy decoding runs in parallel through hierarchical decision tokens and confidence-based action editing.
Where Pith is reading between the lines
- The same token-editing approach could transfer to other embodied tasks where future prediction must stay tightly coupled to control.
- Removing the need for separate world-model and policy heads might simplify training loops in physical agents.
- Token-level editing could make it easier to inspect or intervene on specific parts of a planned trajectory.
Load-bearing premise
Putting visual observations, future states, decisions and actions into one shared discrete token space will preserve enough fidelity for action-conditioned prediction to support policy generation without further alignment steps.
What would settle it
On the same large-scale autonomous-driving benchmarks, a version that keeps separate continuous or non-shared token spaces for vision and action produces equal or better planning metrics than the unified discrete version while using comparable compute.
read the original abstract
Autonomous driving requires reasoning about how ego actions shape future world evolution, rather than merely mapping observations to actions. However, most end-to-end methods rely on direct state-to-action imitation, while existing world models often remain weakly aligned with downstream policy generation. We introduce Discrete-WAM, a unified discrete vision-action world-policy framework that represents visual observations, future states, high-level decisions, and ego actions within a shared token space. Built on this discrete alignment, Discrete-WAM jointly trains world modeling, world-policy modeling, and policy modeling through multi-task and multi-stage pretraining, allowing action-conditioned future prediction to directly support policy generation. For downstream planning, Discrete-WAM further decomposes policy generation into hierarchical decision prediction and parallel action-token editing, where the decision token provides a high-level planning skeleton and confidence-based scheduling refines dense future actions efficiently. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves strong planning performance while supporting controllable future generation, counterfactual evaluation, surprise-based world-model analysis, and efficient parallel policy decoding. These results suggest that discrete representation alignment, unified world-policy training, and hierarchical token editing provide a promising design paradigm for physical AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Discrete-WAM, a unified discrete vision-action world-policy framework for autonomous driving. It places visual observations, future states, high-level decisions, and ego actions in a shared discrete token space, jointly trains world modeling, world-policy modeling, and policy modeling via multi-task and multi-stage pretraining, and decomposes downstream planning into hierarchical decision prediction plus parallel action-token editing with confidence-based scheduling. Experiments on large-scale driving benchmarks are reported to show strong planning performance together with controllable future generation, counterfactual evaluation, surprise-based analysis, and efficient parallel decoding.
Significance. If the empirical results hold under the claimed discrete alignment, the work offers a concrete design paradigm that integrates world modeling and policy generation without separate alignment stages, potentially improving sample efficiency and controllability in physical AI systems. The explicit support for counterfactuals and surprise analysis is a notable strength beyond standard planning metrics.
major comments (1)
- [§3] The central claim that shared discrete token space plus multi-task pretraining allows action-conditioned future prediction to directly support policy generation without additional alignment mechanisms (abstract and §3) rests on an assumption whose load-bearing status is not fully tested; an ablation removing the joint training stages or the shared vocabulary would be required to isolate whether fidelity is preserved or whether implicit alignment emerges.
minor comments (2)
- [§3.2] Notation for the discrete token vocabulary and the hierarchical editing schedule should be introduced with explicit equations rather than prose descriptions to allow reproduction.
- [§4] The paper should report the exact token vocabulary size, codebook learning procedure, and any discretization hyperparameters in a dedicated table or subsection.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential of the unified discrete vision-action framework. We address the single major comment below.
read point-by-point responses
-
Referee: [§3] The central claim that shared discrete token space plus multi-task pretraining allows action-conditioned future prediction to directly support policy generation without additional alignment mechanisms (abstract and §3) rests on an assumption whose load-bearing status is not fully tested; an ablation removing the joint training stages or the shared vocabulary would be required to isolate whether fidelity is preserved or whether implicit alignment emerges.
Authors: We agree that the load-bearing role of the joint multi-task and multi-stage pretraining, together with the shared discrete vocabulary, would be more clearly isolated by an explicit ablation that removes either the joint stages or the shared token space. The current multi-stage schedule already contains progressive stages (world modeling first, followed by joint world-policy training), and the reported planning and generation results are obtained under the full unified setting; however, these do not constitute the precise removal requested. We will therefore add the suggested ablation (separate training with and without shared vocabulary) to the revised manuscript to provide direct empirical evidence on whether the claimed direct support emerges from the joint training or from implicit alignment. revision: yes
Circularity Check
No significant circularity
full rationale
The abstract and available text describe an empirical framework (shared discrete token space, multi-task pretraining, hierarchical editing) with performance claims on driving benchmarks. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided material. The central claim is scoped to design and empirical results rather than a formal derivation that reduces to its inputs by construction. This matches the expected honest non-finding for papers without visible mathematical structure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
2021
-
[2]
Cosmos-reason1: From physical common sense to embodied reasoning
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025
Pith/arXiv arXiv 2025
-
[3]
Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025
arXiv 2025
-
[4]
Accelerated sampling from masked diffusion models via entropy bounded unmasking
Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking. InAdvancesin Neural Information Processing Systems, 2025
2025
-
[5]
Motus: A unified latent action world model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 35101–35113, 2026
2026
-
[6]
nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles
Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021
Pith/arXiv arXiv 2021
-
[7]
Changxiao Cai and Gen Li. Confidence-based decoding is provably efficient for diffusion language models.arXiv preprint arXiv:2603.22248, 2026
arXiv 2026
-
[8]
Pseudo-simulation for autonomous driving
Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving. arXiv preprint arXiv:2506.04218, 2025
arXiv 2025
-
[9]
Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
Pith/arXiv arXiv 2025
-
[10]
Hao Chen, Jiaming Liu, Zhonghao Yan, Nuowei Han, Renrui Zhang, Chenyang Gu, Jialin Gao, Ziyu Guo, Siyuan Qian, Yinxi Wang, et al. Last-r1: Reinforcing action via adaptive physical latent reasoning for vla models.arXiv preprint arXiv:2604.28192, 2026
Pith/arXiv arXiv 2026
-
[11]
End-to-end autonomous driving: Challenges and frontiers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024
Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024
2024
-
[12]
Milestones in autonomous driving and intelligent vehicles: Survey of surveys.IEEE Transactions on Intelligent Vehicles, 8(2):1046–1056, 2022
Long Chen, Yuchen Li, Chao Huang, Bai Li, Yang Xing, Daxin Tian, Li Li, Zhongxu Hu, Xiaoxiang Na, Zixuan Li, et al. Milestones in autonomous driving and intelligent vehicles: Survey of surveys.IEEE Transactions on Intelligent Vehicles, 8(2):1046–1056, 2022
2022
-
[13]
Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024
Pith/arXiv arXiv 2024
-
[14]
Optimal inference schedules for masked diffusion models.arXiv preprint arXiv:2511.04647, 2025
Sitan Chen, Kevin Cong, and Jerry Li. Optimal inference schedules for masked diffusion models.arXiv preprint arXiv:2511.04647, 2025
arXiv 2025
-
[15]
Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving
Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 239–256. Springer, 2024
2024
-
[16]
Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022
Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022
2022
-
[17]
Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024
Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024. 20
2024
-
[18]
Duncan-Johnson and Emanuel Donchin
Carolyn C. Duncan-Johnson and Emanuel Donchin. On quantifying surprise: The variation of event-related potentials with subjective probability.Psychophysiology, 14(5):456–467, 1977
1977
-
[19]
Theoretical benefit and limitation of diffusion language model
Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, and Di He. Theoretical benefit and limitation of diffusion language model. InAdvancesin Neural Information Processing Systems, 2025
2025
-
[20]
Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning
Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Xiangyu Li, Wenyu Liu, Qian Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning. Advancesin Neural Information Processing Systems, 38:32551–32576, 2026
2026
-
[21]
Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024
2024
-
[22]
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026
Pith/arXiv arXiv 2026
-
[23]
World models for autonomous driving: An initial survey.IEEE Transactionson Intelligent Vehicles, 2024
Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey.IEEE Transactionson Intelligent Vehicles, 2024
2024
-
[24]
ipad: Iterative proposal-centric end-to-end autonomous driving
Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, and Chen Lv. ipad: Iterative proposal-centric end-to-end autonomous driving. arXiv preprint arXiv:2505.15111, 2025
arXiv 2025
-
[25]
The integration of prediction and planning in deep learning automated driving systems: A review.IEEE Transactionson Intelligent Vehicles, 10(5):3626–3643, 2024
Steffen Hagedorn, Marcel Hallgarten, Martin Stoll, and Alexandru Paul Condurache. The integration of prediction and planning in deep learning automated driving systems: A review.IEEE Transactionson Intelligent Vehicles, 10(5):3626–3643, 2024
2024
-
[26]
Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
Pith/arXiv arXiv 2023
-
[27]
Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Drivingworld: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024
arXiv 2024
-
[28]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023
2023
-
[29]
Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, and Chen Lv. Automot: A unified vision-language-action model with asynchronous mixture-of-transformers for end-to-end autonomous driving.arXiv preprint arXiv:2603.14851, 2026
Pith/arXiv arXiv 2026
-
[30]
Mindvla-u1: Vla beats va with unified streaming architecture for autonomous driving
Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, and Hongsheng Li. Mindvla-u1: Vla beats va with unified streaming architecture for autonomous driving. arXiv preprint arXiv:2605.12624, 2026
Pith/arXiv arXiv 2026
-
[31]
Gameformer: Game-theoretic modeling and learning of transformer- based interactive prediction and planning for autonomous driving
Zhiyu Huang, Haochen Liu, and Chen Lv. Gameformer: Game-theoretic modeling and learning of transformer- based interactive prediction and planning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3903–3913, 2023
2023
-
[32]
Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025
arXiv 2025
-
[33]
A survey on vision-language-action models for autonomous driving
Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025
2025
-
[34]
Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints, 2025
Peter Karkus, Maximilian Igl, Yuxiao Chen, Kashyap Chitta, Jef Packer, Bertrand Douillard, Ran Tian, Alexander Naumann, Guillermo Garcia-Cobo, Shuhan Tan, et al. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints, 2025
2025
-
[35]
Kakade, and Sitan Chen
Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. InProceedings of the 42nd International Conference 21 on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 30749–30768. PMLR, 2025
2025
-
[36]
Cosmos policy: Fine-tuning video models for visuomotor control and planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026
Pith/arXiv arXiv 2026
-
[37]
3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025
arXiv 2025
-
[38]
Error bounds and optimal schedules for masked diffusions with factorized approximations
Hugo Lavenant and Giacomo Zanella. Error bounds and optimal schedules for masked diffusions with factorized approximations. arXiv preprint arXiv:2510.25544, 2025
arXiv 2025
-
[39]
A path towards autonomous machine intelligence version 0.9
Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022
2022
-
[40]
A convergence theory for diffusion language models: An information-theoretic perspective
Gen Li and Changxiao Cai. A convergence theory for diffusion language models: An information-theoretic perspective. arXiv preprint arXiv:2505.21400, 2025
arXiv 2025
-
[41]
Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, and Xianpeng Lang. Discrete diffusion for reflective vision-language-action models in autonomous driving.arXiv preprint arXiv:2509.20109, 2025
arXiv 2025
-
[42]
Drivevla-w0: World models amplify data scaling law in autonomous driving
Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796, 2025
Pith/arXiv arXiv 2025
-
[43]
End-to-end driving with online trajectory evaluation via bev world model
Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025
2025
-
[44]
Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025
Pith/arXiv arXiv 2025
-
[45]
Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving
Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, et al. Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving. arXiv preprint arXiv:2604.02190, 2026
arXiv 2026
-
[46]
Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024
Pith/arXiv arXiv 2024
-
[47]
Hydra-next: Robust closed-loop driving with open-loop training
Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27305–27314, 2025
2025
-
[48]
Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025
2025
-
[49]
Model-based policy adaptation for closed-loop end-to-end autonomous driving
Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, and Ding Zhao. Model-based policy adaptation for closed-loop end-to-end autonomous driving. InWorkshop on Foundation Models Meet Embodied Agents at CVPR 2025, 2025
2025
-
[50]
Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model
Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, et al. Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model. arXiv preprint arXiv:2512.11226, 2025
arXiv 2025
-
[51]
Reasoning multi-agent behavioral topology for interactive autonomous driving.Advancesin Neural Information Processing Systems, 37:92605–92637, 2024
Haochen Liu, Li Chen, Yu Qiao, Chen Lv, and Hongyang Li. Reasoning multi-agent behavioral topology for interactive autonomous driving.Advancesin Neural Information Processing Systems, 37:92605–92637, 2024
2024
-
[52]
Hybrid-prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2597–2614, 2025
Haochen Liu, Zhiyu Huang, Wenhui Huang, Haohan Yang, Xiaoyu Mo, and Chen Lv. Hybrid-prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2597–2614, 2025. 22
2025
-
[53]
Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2026
Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2026
2026
-
[54]
Sulin Liu, Juno Nam, Andrew Campbell, Hannes St"ark, Yilun Xu, Tommi Jaakkola, and Rafael G’omez- Bombarelli. Think while you generate: Discrete diffusion with planned denoising.arXivpreprintarXiv:2410.06264, 2024
arXiv 2024
-
[55]
Object-centric learning with slot attention.Advancesin neural information processing systems, 33:11525–11538, 2020
Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention.Advancesin neural information processing systems, 33:11525–11538, 2020
2020
-
[56]
Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation
Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation. InEuropean conference on computer vision, pages 329–345. Springer, 2024
2024
-
[57]
Plan for speed: Dilated scheduling for masked diffusion language models
Omer Luxembourg, Haim Permuter, and Eliya Nachmani. Plan for speed: Dilated scheduling for masked diffusion language models. arXiv preprint arXiv:2506.19037, 2025
Pith/arXiv arXiv 2025
-
[58]
Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026
arXiv 2026
-
[59]
dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning
Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, and Chaowei Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning. arXiv preprint arXiv:2512.04459, 2025
arXiv 2025
-
[60]
Jump your steps: Opti- mizing sampling schedule of discrete diffusion models
Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, and Yuki Mitsufuji. Jump your steps: Opti- mizing sampling schedule of discrete diffusion models. InInternational Conference on Learning Representations, volume 2025, pages 96272–96300, 2025
2025
-
[61]
Path planning for masked diffusion model sampling.arXiv preprint arXiv:2502.03540, 2025
Fred Zhangzhi Peng, Zachary Bezemek, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Alexander Tong, and Pranam Chatterjee. Path planning for masked diffusion model sampling.arXiv preprint arXiv:2502.03540, 2025
arXiv 2025
-
[62]
Learn from your mistakes: Self-correcting masked diffusion models
Yair Schiff, Omer Belhasin, Roy Uziel, Guanghan Wang, Marianne Arriola, Gilad Turok, Michael Elad, and Volodymyr Kuleshov. Learn from your mistakes: Self-correcting masked diffusion models. arXiv preprint arXiv:2602.11590, 2026
Pith/arXiv arXiv 2026
-
[63]
Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.Advances in Neural Information Processing Systems, 38: 81565–81585, 2026
Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and ZHAO-XIANG ZHANG. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.Advances in Neural Information Processing Systems, 38: 81565–81585, 2026
2026
-
[64]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
Pith/arXiv arXiv 2024
-
[65]
Chen Shi, Jinrui Xu, Shaoshuai Shi, Kehua Sheng, Bo Zhang, and Li Jiang. Drivewam: Video generative priors enable scalable world-action modeling for autonomous driving.arXiv preprint arXiv:2605.28544, 2026
Pith/arXiv arXiv 2026
-
[66]
Dinov3.arXiv preprint arXiv:2508.10104, 2025
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
Pith/arXiv arXiv 2025
-
[67]
Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving
Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, and Yadan Luo. Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving. arXiv preprint arXiv:2507.04049, 2025
Pith/arXiv arXiv 2025
-
[68]
Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, and Sifa Zheng. Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026
arXiv 2026
-
[69]
Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025
Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, et al. Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025
Pith/arXiv arXiv 2025
-
[70]
Scene as occupancy
Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406–8415, 2023. 23
2023
-
[71]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advancesin Neural Information Processing Systems, 2017
2017
-
[72]
Reflectdrive-2: Reinforcement-learning-aligned self-editing for discrete diffusion driving
Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, and Kun Zhan. Reflectdrive-2: Reinforcement-learning-aligned self-editing for discrete diffusion driving. arXiv preprint arXiv:2605.04647, 2026
Pith/arXiv arXiv 2026
-
[73]
Latent-wam: Latent world action modeling for end-to-end autonomous driving
Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, et al. Latent-wam: Latent world action modeling for end-to-end autonomous driving. arXiv preprint arXiv:2603.24581, 2026
arXiv 2026
-
[74]
Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving
Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, and Cheng Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving. arXiv preprint arXiv:2601.22032, 2026
arXiv 2026
-
[75]
Social interactions for autonomous driving: A review and perspectives.Foundations and Trends® in Robotics, 10(3-4):198–377, 2022
Wenshuo Wang, Letian Wang, Chengyuan Zhang, Changliu Liu, and Lijun Sun. Social interactions for autonomous driving: A review and perspectives.Foundations and Trends® in Robotics, 10(3-4):198–377, 2022
2022
-
[76]
Drivedreamer: Towards real-world-drive world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024
2024
-
[77]
Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving
Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024
2024
-
[78]
Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for autonomous driving.arXiv preprint arXiv:2409.03272, 2024
arXiv 2024
-
[79]
Para-drive: Parallelized architecture for real-time autonomous driving
Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024
2024
-
[80]
Drivelaw: Unifying planning and video generation in a latent driving world
Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 39701–39712, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.