pith. sign in

arxiv: 2512.10226 · v2 · submitted 2025-12-11 · 💻 cs.CV · cs.RO

Latent Chain-of-Thought World Modeling for End-to-End Driving

Pith reviewed 2026-05-16 23:13 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords end-to-end drivingchain-of-thought reasoninglatent world modelreinforcement learningvision-language-actionautonomous drivingtrajectory prediction
0
0 comments X p. Extension

The pith

LCDrive reasons about driving actions using latent tokens for proposals and future outcomes instead of text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LCDrive, a vision-language-action model for end-to-end driving that replaces natural language chain-of-thought with a latent language. It interleaves action-proposal tokens that share the model's output vocabulary and world-model tokens that express the future results of those actions. Training begins with supervision from ground-truth future scene rollouts to initialize the latent reasoning, followed by closed-loop reinforcement learning to refine it. On large-scale driving benchmarks this yields faster inference, higher-quality trajectories, and larger performance gains from interactive reinforcement learning than both non-reasoning baselines and text-based reasoning models.

Core claim

LCDrive unifies chain-of-thought reasoning and decision making by representing both in an action-aligned latent space: the model interleaves action-proposal tokens drawn from the same vocabulary as its output actions with world-model tokens grounded in a learned latent world model that expresses the future outcomes of the proposed actions.

What carries the argument

Interleaving of action-proposal tokens and world-model tokens in a learned latent space that directly captures action outcomes.

If this is right

  • LCDrive runs inference faster than both non-reasoning and text-reasoning baselines.
  • It produces higher-quality driving trajectories on large-scale benchmarks.
  • It shows larger performance gains when post-trained with closed-loop reinforcement learning.
  • The latent representation supports unified reasoning and action selection for challenging driving scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-token approach could be tested on other sequential control tasks where text reasoning is slow or imprecise.
  • Extending the world-model tokens to predict uncertainty or rare events might further improve safety without added text overhead.
  • Combining this method with richer sensor inputs could test whether the latent space scales to more complex environments.

Load-bearing premise

The learned world-model tokens correctly express the actual future consequences of the actions the model proposes.

What would settle it

If the future scenes predicted by the world-model tokens diverge from the real futures observed when the vehicle executes the proposed actions in closed-loop tests.

Figures

Figures reproduced from arXiv: 2512.10226 by Boris Ivanovic, Kashyap Chitta, Marco Pavone, Philipp Krahenbuhl, Ran Tian, Shuhan Tan, Wenjie Luo, Yan Wang, Yulong Cao, Yurong You, Yuxiao Chen.

Figure 1
Figure 1. Figure 1: Latent Chain-of-Thought Reasoning. Compared to text-based CoT, our proposed Latent CoT provides more efficient and aligned reasoning traces for end-to-end driving VLA models. based chain-of-thought (CoT) before committing to ac￾tions [14, 24, 33, 34, 41]. While this is a natural choice following recent works on reasoning LLMs [36], a textual CoT presents several limitations when applied to driving. First, … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture. Overview of our proposed latent reasoning framework. E2E driving as modeling an autoregressive distribution over a token sequence that concatenates input information, (op￾tional) reasoning trace, and the future trajectory of the ego vehicle τ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training strategy. We first use a base non-reasoning VLA to create latent CoT data, and cold start LCDrive by super￾vised learning. Then, we conduct reinforcement learning to acti￾vate useful reasoning capacity of LCDrive. In this paper, we fix both K and B at training and evalua￾tion for simplicity. Action Prediction. The complete reasoning context is REASON = [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Results. Qualitative comparison of textual and latent reasoning in driving VLA models. Latent CoT captures fine-grained spatial relationships and multi-agent interactions while using a smaller inference budget, leading to more stable and accurate trajectory predictions. In each case, we highlight the main misalignment of the Text CoT reasoning with the final trajectory. 4.3. Qualitative Results… view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency Curve. We train differnet variants of LCDrive with different reasoning depth K and branch factor B. C. Inference Efficiency Study C.1. Ablation Study on Reasoning Depth In this section, we study the trade-off between the reasoning token budget and trajectory accuracy by varying the reason￾ing depth K and branch factor B of LCDrive (GT LWM, Non-RL). For each variant, we construct the CoT supervi￾… view at source ↗
read the original abstract

Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LCDrive, a Vision-Language-Action model for end-to-end driving that performs chain-of-thought reasoning in a latent action-aligned space. Reasoning interleaves action-proposal tokens (sharing vocabulary with output actions) and world-model tokens grounded in a learned latent world model that express future outcomes. The model is cold-started via supervision on ground-truth future rollouts and then post-trained with closed-loop reinforcement learning. The central claim is that LCDrive achieves faster inference, higher trajectory quality, and larger gains from interactive RL than non-reasoning and text-reasoning baselines on a large-scale driving benchmark.

Significance. If the empirical results hold, the work would demonstrate a concrete advantage for latent (rather than text) reasoning representations in safety-critical control tasks, with potential benefits for inference latency and alignment between reasoning and action outcomes. The combination of cold-start supervision and closed-loop RL is a standard recipe, but the specific latent tokenization could be a reusable idea for other VLA domains.

major comments (2)
  1. [Experiments / Results] The strongest claim—that latent CoT yields larger RL improvements than text-based or non-reasoning baselines—rests on the assumption that world-model tokens learned from expert rollouts remain accurate for the model's own on-policy action proposals. The manuscript provides no ablation or diagnostic (e.g., prediction error of world-model tokens on states visited during RL) that directly tests this transfer; without it the reported RL gains cannot be confidently attributed to the latent reasoning mechanism rather than other factors.
  2. [Experiments] The evaluation section does not report quantitative metrics, error bars, exact baseline implementations, or data-split details for the claimed improvements in inference speed and trajectory quality. These omissions make it impossible to assess effect sizes or reproducibility of the central performance claims.
minor comments (1)
  1. [Abstract] The abstract states performance improvements without any numerical values; a single sentence summarizing the magnitude of gains (e.g., “X% higher success rate, Y ms faster inference”) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional diagnostics and details as described.

read point-by-point responses
  1. Referee: [Experiments / Results] The strongest claim—that latent CoT yields larger RL improvements than text-based or non-reasoning baselines—rests on the assumption that world-model tokens learned from expert rollouts remain accurate for the model's own on-policy action proposals. The manuscript provides no ablation or diagnostic (e.g., prediction error of world-model tokens on states visited during RL) that directly tests this transfer; without it the reported RL gains cannot be confidently attributed to the latent reasoning mechanism rather than other factors.

    Authors: We agree that a direct diagnostic would strengthen attribution of the RL gains specifically to the latent reasoning mechanism. The current results show larger RL improvements for LCDrive than baselines, but without an on-policy accuracy check this could partly reflect other factors. In the revision we will add an ablation measuring world-model token prediction error on states visited during closed-loop RL (comparing to the expert-rollout supervision used in cold-start), which will clarify the transfer and support the central claim. revision: yes

  2. Referee: [Experiments] The evaluation section does not report quantitative metrics, error bars, exact baseline implementations, or data-split details for the claimed improvements in inference speed and trajectory quality. These omissions make it impossible to assess effect sizes or reproducibility of the central performance claims.

    Authors: We acknowledge these omissions limit assessment of effect sizes and reproducibility. The revised manuscript will report the full quantitative metrics (including inference latency and trajectory quality scores), error bars computed over multiple random seeds, exact baseline implementations with hyperparameter details, and the precise data-split protocol used on the large-scale benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external supervision and benchmark evaluation

full rationale

The paper defines LCDrive via cold-start supervision of latent tokens on ground-truth future rollouts, followed by closed-loop RL post-training, with all performance claims resting on comparative results against non-reasoning and text-reasoning baselines on a large-scale external driving benchmark. No equations reduce a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from self-citations to force the architecture, and the latent world-model tokens are trained against observable rollouts rather than defined in terms of the final RL outcomes. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that a single learned latent space can simultaneously represent actionable proposals and accurate future scene outcomes; this is introduced without independent external validation in the abstract.

axioms (1)
  • domain assumption A learned latent world model can ground reasoning tokens to express future outcomes of proposed actions.
    Invoked in the design of world model tokens and cold-start supervision from ground-truth rollouts.
invented entities (1)
  • Latent world model tokens no independent evidence
    purpose: Express future outcomes of actions within the shared latent space for reasoning.
    New representational element introduced to unify CoT and decision making; no independent falsifiable evidence outside the model is provided.

pith-pipeline@v0.9.0 · 5566 in / 1448 out tokens · 37933 ms · 2026-05-16T23:13:50.798754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning

  • IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    interleaving (1) action-proposal tokens... and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  2. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 3

  2. [2]

    nuScenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 5

  3. [3]

    Unveiling the key factors for dis- tilling chain-of-thought reasoning

    Xinghao Chen, Zhijing Sun, Guo Wenjin, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, et al. Unveiling the key factors for dis- tilling chain-of-thought reasoning. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 15094–15119, 2025. 2

  4. [4]

    Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025

    Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Han- lin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025. 2

  5. [5]

    From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

    Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024. 2

  6. [6]

    Efficient reasoning models: A survey

    Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025. 2

  7. [7]

    ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation.arXiv preprint arXiv:2503.19755,

  8. [8]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. 2

  9. [9]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. 2

  10. [10]

    Model-based imitation learning for urban driving.Advances in Neural Information Process- ing Systems, 35:20703–20716, 2022

    Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving.Advances in Neural Information Process- ing Systems, 35:20703–20716, 2022. 2

  11. [11]

    UniAD: Unified perception and predic- tion for autonomous driving

    Hanxue Hu, Ye Yuan, Hongyang Xu, Zhaoyang Chen, Ming Liang, Zhiding Li, Yuexin Ma, Xiaodong Shen, Yuning Chai, Xiaoqing Tan, et al. UniAD: Unified perception and predic- tion for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 1, 2

  12. [12]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. ViPE: Video pose engine for 3D geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025. 9

  13. [13]

    Safedreamer: Safe reinforcement learn- ing with world models.arXiv preprint arXiv:2307.07176,

    Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. Safedreamer: Safe reinforcement learn- ing with world models.arXiv preprint arXiv:2307.07176,

  14. [14]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Cov- ington, Benjamin Sapp, et al. EMMA: End-to-end mul- timodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024. 1, 2

  15. [15]

    Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025a

    Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language- action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025. 2

  16. [16]

    Reinforcement learning: A survey.Journal of artifi- cial intelligence research, 4:237–285, 1996

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artifi- cial intelligence research, 4:237–285, 1996. 2

  17. [17]

    Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025

    Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025. 1

  18. [18]

    OpenBox: Annotate any bound- ing boxes in 3d

    In-Jae Lee, Mungyeom Kim, Kwonyoung Ryu, Pierre Musacchio, and Jaesik Park. OpenBox: Annotate any bound- ing boxes in 3d. InProceedings of the Int. Conf. on Neural Information Processing Systems (NeurIPS), 2025. 9

  19. [19]

    Latent Visual Reasoning

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 2

  20. [20]

    Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2)

    Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2). InEuropean Conference on Computer Vision, pages 142–158. Springer, 2024. 2

  21. [21]

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 5

  22. [22]

    Dreamdrive: Generative 4d scene modeling from street view images

    Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. In2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 367–374. IEEE, 2025. 2 9

  23. [23]

    Physical AI autonomous vehicles dataset.https: / / huggingface

    NVIDIA. Physical AI autonomous vehicles dataset.https: / / huggingface . co / datasets / nvidia / PhysicalAI - Autonomous-Vehicles, 2025. 2, 5, 6, 7

  24. [24]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    NVIDIA, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yi- fan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Don- gran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Ja- son Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinge...

  25. [25]

    DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2023. 6

  26. [26]

    Better Call SAL: Towards learning to segment anything in lidar

    Aljosa Osep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, and Laura Leal-Taixé. Better Call SAL: Towards learning to segment anything in lidar. InEu- ropean Conference on Computer Vision (ECCV), 2024. 9

  27. [27]

    Mitigating covariate shift in imitation learning for au- tonomous vehicles using latent space generative world mod- els.arXiv preprint arXiv:2409.16663, 2024

    Alexander Popov, Alperen Degirmenci, David Wehr, Shashank Hegde, Ryan Oldja, Alexey Kamenev, Bertrand Douillard, David Nistér, Urs Muller, Ruchi Bhargava, et al. Mitigating covariate shift in imitation learning for au- tonomous vehicles using latent space generative world mod- els.arXiv preprint arXiv:2409.16663, 2024. 2

  28. [28]

    Qwen3-VL: Sharper vision, deeper thought, broader action.https : / / qwen

    Qwen Team. Qwen3-VL: Sharper vision, deeper thought, broader action.https : / / qwen . ai / blog ? id = 99f0335c4ad9ff6153e517418d48535ab6d8afef & from = research.latest-advancements-list, 2025. 3

  29. [29]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:24...

  30. [30]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, et al. Mas- tering Atari, Go, Chess and Shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019. 2

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5

  32. [32]

    La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025

    Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Di- anat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025. 2

  33. [33]

    Tokenize the world into object-level knowledge to address long-tail events in autonomous driving

    Ran Tian, Boyi Li, Xinshuo Weng, Yuxiao Chen, Edward Schmerling, Yue Wang, Boris Ivanovic, and Marco Pavone. Tokenize the world into object-level knowledge to address long-tail events in autonomous driving. InConference on Robot Learning, 2024. 1, 2

  34. [34]

    DriveCoT: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996, 2024

    Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, and Ping Luo. DriveCoT: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996, 2024. 1, 2

  35. [35]

    Drivedreamer: Towards real-world- drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 2

  36. [36]

    Chain-of-thought prompting elicits reasoning in large lan- guage models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. InAdvances in Neural Information Process- ing Systems, 2022. 1, 2

  37. [37]

    PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449– 15458, 2024. 1, 2

  38. [38]

    S4-driver: Scalable self-supervised driving mul- timodal large language model with spatio-temporal visual representation

    Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, et al. S4-driver: Scalable self-supervised driving mul- timodal large language model with spatio-temporal visual representation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1622–1632, 2025. 2

  39. [39]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang...

  40. [40]

    OpenDriveVLA: Towards end- to-end autonomous driving with large vision language action model.arXiv preprint arXiv:2503.23463,

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, and Alois C Knoll. OpenDriveVLA: Towards end-to-end au- tonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463, 2025. 2

  41. [41]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. AutoVLA: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025. 1, 2 10 A. Additional Implementation Details A.1. Latent World Model Encoder Our latent world model (...

  42. [42]

    This produces a sequence of per- agent, per-timestep features of shapeR B×N×T×d lwm

    a learned timestep embedding added along the temporal axis; 2) an agent-type embedding (shared over timesteps) added per agent; 3) a stack of MLP residual blocks along the feature dimension. This produces a sequence of per- agent, per-timestep features of shapeR B×N×T×d lwm. Temporal pooling per agent.To summarize theT=10 timesteps into a single feature p...

  43. [43]

    This means that even though the rea- soning branches provide two candidate future plans, the de- coder does not simply copy a branch

    Final actions improve upon the reasoning propos- als.In both settings, we observe that Final-Action Quality <Reasoning Quality. This means that even though the rea- soning branches provide two candidate future plans, the de- coder does not simply copy a branch. Instead, it selects the more promising proposal and furtherrefinesit to produce a more accurate...

  44. [44]

    This shows that the proposal actions are actively used

    Strong alignment between reasoning proposals and the final action.Across both models, the Reasoning– Action Alignment score remains small, indicating that the final trajectory lies close to at least one of the proposal branches. This shows that the proposal actions are actively used. After RL, the alignment improves (0.614→0.581), indicating that RL stren...

  45. [45]

    This is es- sential in multi-agent driving scenarios with inherent un- certainty

    Reasoning branches maintain meaningful diver- sity.The Diversity score for both models indicates the two branches represent distinct motion hypotheses. This is es- sential in multi-agent driving scenarios with inherent un- certainty. RL slightly reduces diversity (0.412→0.353), but the branches remain significantly different. In other words, RL makes expl...

  46. [46]

    Introducing even a minimal amount of latent reasoning (e.g.,K=1,B=2with 24 tokens) pro- duces a clear reduction in ADE

    Latent CoT provides consistent improvements over the baselineThe leftmost point corresponds to the non- reasoning model. Introducing even a minimal amount of latent reasoning (e.g.,K=1,B=2with 24 tokens) pro- duces a clear reduction in ADE. This demonstrates that a small number of interleaved action-proposal and latent world-model tokens already provides ...

  47. [47]

    The largest gains are obtained when moving from shallow reasoning (e.g., K=1,2) to larger reasoning depth (K=3–5)

    Increasing reasoning budget yields meaningful gainsAs we increase(K, B), performance improves smoothly, indicating that deeper latent reasoning enables the model to explore more steps into the future and pro- duce better action plans based on that. The largest gains are obtained when moving from shallow reasoning (e.g., K=1,2) to larger reasoning depth (K...

  48. [48]

    Models with multiple branches (e.g., K=5, B=2) outperform the one with the same depth but fewer branches (e.g.,K=5, B=1)

    Branching (B) leads to complementary improve- ments to depth (K)Branches encourage diverse coun- terfactual futures. Models with multiple branches (e.g., K=5, B=2) outperform the one with the same depth but fewer branches (e.g.,K=5, B=1). This aligns with our diversity analysis: exploring alternative counterfactual fu- tures provides richer reasoning sign...