Latent Chain-of-Thought World Modeling for End-to-End Driving

arxiv: 2512.10226 · v2 · submitted 2025-12-11 · 💻 cs.CV · cs.RO

Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan , Kashyap Chitta , Yuxiao Chen , Ran Tian , Yurong You , Yan Wang , Wenjie Luo , Yulong Cao

show 3 more authors

Philipp Krahenbuhl Marco Pavone Boris Ivanovic

This is my paper

Pith reviewed 2026-05-16 23:13 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords end-to-end drivingchain-of-thought reasoninglatent world modelreinforcement learningvision-language-actionautonomous drivingtrajectory prediction

0 comments p. Extension

The pith

LCDrive reasons about driving actions using latent tokens for proposals and future outcomes instead of text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LCDrive, a vision-language-action model for end-to-end driving that replaces natural language chain-of-thought with a latent language. It interleaves action-proposal tokens that share the model's output vocabulary and world-model tokens that express the future results of those actions. Training begins with supervision from ground-truth future scene rollouts to initialize the latent reasoning, followed by closed-loop reinforcement learning to refine it. On large-scale driving benchmarks this yields faster inference, higher-quality trajectories, and larger performance gains from interactive reinforcement learning than both non-reasoning baselines and text-based reasoning models.

Core claim

LCDrive unifies chain-of-thought reasoning and decision making by representing both in an action-aligned latent space: the model interleaves action-proposal tokens drawn from the same vocabulary as its output actions with world-model tokens grounded in a learned latent world model that expresses the future outcomes of the proposed actions.

What carries the argument

Interleaving of action-proposal tokens and world-model tokens in a learned latent space that directly captures action outcomes.

If this is right

LCDrive runs inference faster than both non-reasoning and text-reasoning baselines.
It produces higher-quality driving trajectories on large-scale benchmarks.
It shows larger performance gains when post-trained with closed-loop reinforcement learning.
The latent representation supports unified reasoning and action selection for challenging driving scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-token approach could be tested on other sequential control tasks where text reasoning is slow or imprecise.
Extending the world-model tokens to predict uncertainty or rare events might further improve safety without added text overhead.
Combining this method with richer sensor inputs could test whether the latent space scales to more complex environments.

Load-bearing premise

The learned world-model tokens correctly express the actual future consequences of the actions the model proposes.

What would settle it

If the future scenes predicted by the world-model tokens diverge from the real futures observed when the vehicle executes the proposed actions in closed-loop tests.

Figures

Figures reproduced from arXiv: 2512.10226 by Boris Ivanovic, Kashyap Chitta, Marco Pavone, Philipp Krahenbuhl, Ran Tian, Shuhan Tan, Wenjie Luo, Yan Wang, Yulong Cao, Yurong You, Yuxiao Chen.

**Figure 1.** Figure 1: Latent Chain-of-Thought Reasoning. Compared to text-based CoT, our proposed Latent CoT provides more efficient and aligned reasoning traces for end-to-end driving VLA models. based chain-of-thought (CoT) before committing to actions [14, 24, 33, 34, 41]. While this is a natural choice following recent works on reasoning LLMs [36], a textual CoT presents several limitations when applied to driving. First, … view at source ↗

**Figure 2.** Figure 2: Architecture. Overview of our proposed latent reasoning framework. E2E driving as modeling an autoregressive distribution over a token sequence that concatenates input information, (optional) reasoning trace, and the future trajectory of the ego vehicle τ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Training strategy. We first use a base non-reasoning VLA to create latent CoT data, and cold start LCDrive by supervised learning. Then, we conduct reinforcement learning to activate useful reasoning capacity of LCDrive. In this paper, we fix both K and B at training and evaluation for simplicity. Action Prediction. The complete reasoning context is REASON = [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Results. Qualitative comparison of textual and latent reasoning in driving VLA models. Latent CoT captures fine-grained spatial relationships and multi-agent interactions while using a smaller inference budget, leading to more stable and accurate trajectory predictions. In each case, we highlight the main misalignment of the Text CoT reasoning with the final trajectory. 4.3. Qualitative Results… view at source ↗

**Figure 5.** Figure 5: Efficiency Curve. We train differnet variants of LCDrive with different reasoning depth K and branch factor B. C. Inference Efficiency Study C.1. Ablation Study on Reasoning Depth In this section, we study the trade-off between the reasoning token budget and trajectory accuracy by varying the reasoning depth K and branch factor B of LCDrive (GT LWM, Non-RL). For each variant, we construct the CoT supervi… view at source ↗

read the original abstract

Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LCDrive's latent CoT with interleaved action and world-model tokens is a clear step past text-based VLA reasoning, but the abstract's missing numbers leave the size of the gains unclear.

read the letter

The core idea here is to replace natural-language chain-of-thought with a latent space where the model interleaves action-proposal tokens (sharing the output vocabulary) and world-model tokens that are meant to express future outcomes. They cold-start the whole thing by supervising both on ground-truth future rollouts, then fine-tune with closed-loop RL. That setup is distinct from the text-CoT baselines in prior VLA driving work and could plausibly run faster at inference time while giving the model a grounded way to reason about its own proposed actions.

Referee Report

2 major / 1 minor

Summary. The paper introduces LCDrive, a Vision-Language-Action model for end-to-end driving that performs chain-of-thought reasoning in a latent action-aligned space. Reasoning interleaves action-proposal tokens (sharing vocabulary with output actions) and world-model tokens grounded in a learned latent world model that express future outcomes. The model is cold-started via supervision on ground-truth future rollouts and then post-trained with closed-loop reinforcement learning. The central claim is that LCDrive achieves faster inference, higher trajectory quality, and larger gains from interactive RL than non-reasoning and text-reasoning baselines on a large-scale driving benchmark.

Significance. If the empirical results hold, the work would demonstrate a concrete advantage for latent (rather than text) reasoning representations in safety-critical control tasks, with potential benefits for inference latency and alignment between reasoning and action outcomes. The combination of cold-start supervision and closed-loop RL is a standard recipe, but the specific latent tokenization could be a reusable idea for other VLA domains.

major comments (2)

[Experiments / Results] The strongest claim—that latent CoT yields larger RL improvements than text-based or non-reasoning baselines—rests on the assumption that world-model tokens learned from expert rollouts remain accurate for the model's own on-policy action proposals. The manuscript provides no ablation or diagnostic (e.g., prediction error of world-model tokens on states visited during RL) that directly tests this transfer; without it the reported RL gains cannot be confidently attributed to the latent reasoning mechanism rather than other factors.
[Experiments] The evaluation section does not report quantitative metrics, error bars, exact baseline implementations, or data-split details for the claimed improvements in inference speed and trajectory quality. These omissions make it impossible to assess effect sizes or reproducibility of the central performance claims.

minor comments (1)

[Abstract] The abstract states performance improvements without any numerical values; a single sentence summarizing the magnitude of gains (e.g., “X% higher success rate, Y ms faster inference”) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional diagnostics and details as described.

read point-by-point responses

Referee: [Experiments / Results] The strongest claim—that latent CoT yields larger RL improvements than text-based or non-reasoning baselines—rests on the assumption that world-model tokens learned from expert rollouts remain accurate for the model's own on-policy action proposals. The manuscript provides no ablation or diagnostic (e.g., prediction error of world-model tokens on states visited during RL) that directly tests this transfer; without it the reported RL gains cannot be confidently attributed to the latent reasoning mechanism rather than other factors.

Authors: We agree that a direct diagnostic would strengthen attribution of the RL gains specifically to the latent reasoning mechanism. The current results show larger RL improvements for LCDrive than baselines, but without an on-policy accuracy check this could partly reflect other factors. In the revision we will add an ablation measuring world-model token prediction error on states visited during closed-loop RL (comparing to the expert-rollout supervision used in cold-start), which will clarify the transfer and support the central claim. revision: yes
Referee: [Experiments] The evaluation section does not report quantitative metrics, error bars, exact baseline implementations, or data-split details for the claimed improvements in inference speed and trajectory quality. These omissions make it impossible to assess effect sizes or reproducibility of the central performance claims.

Authors: We acknowledge these omissions limit assessment of effect sizes and reproducibility. The revised manuscript will report the full quantitative metrics (including inference latency and trajectory quality scores), error bars computed over multiple random seeds, exact baseline implementations with hyperparameter details, and the precise data-split protocol used on the large-scale benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external supervision and benchmark evaluation

full rationale

The paper defines LCDrive via cold-start supervision of latent tokens on ground-truth future rollouts, followed by closed-loop RL post-training, with all performance claims resting on comparative results against non-reasoning and text-reasoning baselines on a large-scale external driving benchmark. No equations reduce a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from self-citations to force the architecture, and the latent world-model tokens are trained against observable rollouts rather than defined in terms of the final RL outcomes. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that a single learned latent space can simultaneously represent actionable proposals and accurate future scene outcomes; this is introduced without independent external validation in the abstract.

axioms (1)

domain assumption A learned latent world model can ground reasoning tokens to express future outcomes of proposed actions.
Invoked in the design of world model tokens and cold-start supervision from ground-truth rollouts.

invented entities (1)

Latent world model tokens no independent evidence
purpose: Express future outcomes of actions within the shared latent space for reasoning.
New representational element introduced to unify CoT and decision making; no independent falsifiable evidence outside the model is provided.

pith-pipeline@v0.9.0 · 5566 in / 1448 out tokens · 37933 ms · 2026-05-16T23:13:50.798754+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

interleaving (1) action-proposal tokens... and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

nuScenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 5

work page 2020
[3]

Unveiling the key factors for dis- tilling chain-of-thought reasoning

Xinghao Chen, Zhijing Sun, Guo Wenjin, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, et al. Unveiling the key factors for dis- tilling chain-of-thought reasoning. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 15094–15119, 2025. 2

work page 2025
[4]

Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Han- lin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025. 2

work page arXiv 2025
[5]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Efficient reasoning models: A survey

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025. 2

work page arXiv 2025
[7]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation.arXiv preprint arXiv:2503.19755,

work page internal anchor Pith review arXiv
[8]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Model-based imitation learning for urban driving.Advances in Neural Information Process- ing Systems, 35:20703–20716, 2022

Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving.Advances in Neural Information Process- ing Systems, 35:20703–20716, 2022. 2

work page 2022
[11]

UniAD: Unified perception and predic- tion for autonomous driving

Hanxue Hu, Ye Yuan, Hongyang Xu, Zhaoyang Chen, Ming Liang, Zhiding Li, Yuexin Ma, Xiaodong Shen, Yuning Chai, Xiaoqing Tan, et al. UniAD: Unified perception and predic- tion for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 1, 2

work page 2023
[12]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. ViPE: Video pose engine for 3D geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Safedreamer: Safe reinforcement learn- ing with world models.arXiv preprint arXiv:2307.07176,

Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. Safedreamer: Safe reinforcement learn- ing with world models.arXiv preprint arXiv:2307.07176,

work page arXiv
[14]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Cov- ington, Benjamin Sapp, et al. EMMA: End-to-end mul- timodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025a

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language- action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025. 2

work page arXiv 2025
[16]

Reinforcement learning: A survey.Journal of artifi- cial intelligence research, 4:237–285, 1996

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artifi- cial intelligence research, 4:237–285, 1996. 2

work page 1996
[17]

Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025. 1

work page 2025
[18]

OpenBox: Annotate any bound- ing boxes in 3d

In-Jae Lee, Mungyeom Kim, Kwonyoung Ryu, Pierre Musacchio, and Jaesik Park. OpenBox: Annotate any bound- ing boxes in 3d. InProceedings of the Int. Conf. on Neural Information Processing Systems (NeurIPS), 2025. 9

work page 2025
[19]

Latent Visual Reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2). InEuropean Conference on Computer Vision, pages 142–158. Springer, 2024. 2

work page 2024
[21]

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 5

work page 2024
[22]

Dreamdrive: Generative 4d scene modeling from street view images

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. In2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 367–374. IEEE, 2025. 2 9

work page 2025
[23]

Physical AI autonomous vehicles dataset.https: / / huggingface

NVIDIA. Physical AI autonomous vehicles dataset.https: / / huggingface . co / datasets / nvidia / PhysicalAI - Autonomous-Vehicles, 2025. 2, 5, 6, 7

work page 2025
[24]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yi- fan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Don- gran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Ja- son Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinge...

work page internal anchor Pith review arXiv 2025
[25]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2023. 6

work page 2023
[26]

Better Call SAL: Towards learning to segment anything in lidar

Aljosa Osep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, and Laura Leal-Taixé. Better Call SAL: Towards learning to segment anything in lidar. InEu- ropean Conference on Computer Vision (ECCV), 2024. 9

work page 2024
[27]

Mitigating covariate shift in imitation learning for au- tonomous vehicles using latent space generative world mod- els.arXiv preprint arXiv:2409.16663, 2024

Alexander Popov, Alperen Degirmenci, David Wehr, Shashank Hegde, Ryan Oldja, Alexey Kamenev, Bertrand Douillard, David Nistér, Urs Muller, Ruchi Bhargava, et al. Mitigating covariate shift in imitation learning for au- tonomous vehicles using latent space generative world mod- els.arXiv preprint arXiv:2409.16663, 2024. 2

work page arXiv 2024
[28]

Qwen3-VL: Sharper vision, deeper thought, broader action.https : / / qwen

Qwen Team. Qwen3-VL: Sharper vision, deeper thought, broader action.https : / / qwen . ai / blog ? id = 99f0335c4ad9ff6153e517418d48535ab6d8afef & from = research.latest-advancements-list, 2025. 3

work page 2025
[29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, et al. Mas- tering Atari, Go, Chess and Shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019. 2

work page internal anchor Pith review arXiv 1911
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Di- anat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025. 2

work page arXiv 2025
[33]

Tokenize the world into object-level knowledge to address long-tail events in autonomous driving

Ran Tian, Boyi Li, Xinshuo Weng, Yuxiao Chen, Edward Schmerling, Yue Wang, Boris Ivanovic, and Marco Pavone. Tokenize the world into object-level knowledge to address long-tail events in autonomous driving. InConference on Robot Learning, 2024. 1, 2

work page 2024
[34]

DriveCoT: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996, 2024

Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, and Ping Luo. DriveCoT: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996, 2024. 1, 2

work page arXiv 2024
[35]

Drivedreamer: Towards real-world- drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 2

work page 2024
[36]

Chain-of-thought prompting elicits reasoning in large lan- guage models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. InAdvances in Neural Information Process- ing Systems, 2022. 1, 2

work page 2022
[37]

PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449– 15458, 2024. 1, 2

work page 2024
[38]

S4-driver: Scalable self-supervised driving mul- timodal large language model with spatio-temporal visual representation

Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, et al. S4-driver: Scalable self-supervised driving mul- timodal large language model with spatio-temporal visual representation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1622–1632, 2025. 2

work page 2025
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

OpenDriveVLA: Towards end- to-end autonomous driving with large vision language action model.arXiv preprint arXiv:2503.23463,

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, and Alois C Knoll. OpenDriveVLA: Towards end-to-end au- tonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463, 2025. 2

work page arXiv 2025
[41]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. AutoVLA: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025. 1, 2 10 A. Additional Implementation Details A.1. Latent World Model Encoder Our latent world model (...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

This produces a sequence of per- agent, per-timestep features of shapeR B×N×T×d lwm

a learned timestep embedding added along the temporal axis; 2) an agent-type embedding (shared over timesteps) added per agent; 3) a stack of MLP residual blocks along the feature dimension. This produces a sequence of per- agent, per-timestep features of shapeR B×N×T×d lwm. Temporal pooling per agent.To summarize theT=10 timesteps into a single feature p...

work page
[43]

This means that even though the rea- soning branches provide two candidate future plans, the de- coder does not simply copy a branch

Final actions improve upon the reasoning propos- als.In both settings, we observe that Final-Action Quality <Reasoning Quality. This means that even though the rea- soning branches provide two candidate future plans, the de- coder does not simply copy a branch. Instead, it selects the more promising proposal and furtherrefinesit to produce a more accurate...

work page
[44]

This shows that the proposal actions are actively used

Strong alignment between reasoning proposals and the final action.Across both models, the Reasoning– Action Alignment score remains small, indicating that the final trajectory lies close to at least one of the proposal branches. This shows that the proposal actions are actively used. After RL, the alignment improves (0.614→0.581), indicating that RL stren...

work page
[45]

This is es- sential in multi-agent driving scenarios with inherent un- certainty

Reasoning branches maintain meaningful diver- sity.The Diversity score for both models indicates the two branches represent distinct motion hypotheses. This is es- sential in multi-agent driving scenarios with inherent un- certainty. RL slightly reduces diversity (0.412→0.353), but the branches remain significantly different. In other words, RL makes expl...

work page
[46]

Introducing even a minimal amount of latent reasoning (e.g.,K=1,B=2with 24 tokens) pro- duces a clear reduction in ADE

Latent CoT provides consistent improvements over the baselineThe leftmost point corresponds to the non- reasoning model. Introducing even a minimal amount of latent reasoning (e.g.,K=1,B=2with 24 tokens) pro- duces a clear reduction in ADE. This demonstrates that a small number of interleaved action-proposal and latent world-model tokens already provides ...

work page
[47]

The largest gains are obtained when moving from shallow reasoning (e.g., K=1,2) to larger reasoning depth (K=3–5)

Increasing reasoning budget yields meaningful gainsAs we increase(K, B), performance improves smoothly, indicating that deeper latent reasoning enables the model to explore more steps into the future and pro- duce better action plans based on that. The largest gains are obtained when moving from shallow reasoning (e.g., K=1,2) to larger reasoning depth (K...

work page
[48]

Models with multiple branches (e.g., K=5, B=2) outperform the one with the same depth but fewer branches (e.g.,K=5, B=1)

Branching (B) leads to complementary improve- ments to depth (K)Branches encourage diverse coun- terfactual futures. Models with multiple branches (e.g., K=5, B=2) outperform the one with the same depth but fewer branches (e.g.,K=5, B=1). This aligns with our diversity analysis: exploring alternative counterfactual fu- tures provides richer reasoning sign...

work page

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

nuScenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 5

work page 2020

[3] [3]

Unveiling the key factors for dis- tilling chain-of-thought reasoning

Xinghao Chen, Zhijing Sun, Guo Wenjin, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, et al. Unveiling the key factors for dis- tilling chain-of-thought reasoning. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 15094–15119, 2025. 2

work page 2025

[4] [4]

Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Han- lin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025. 2

work page arXiv 2025

[5] [5]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Efficient reasoning models: A survey

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025. 2

work page arXiv 2025

[7] [7]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation.arXiv preprint arXiv:2503.19755,

work page internal anchor Pith review arXiv

[8] [8]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Model-based imitation learning for urban driving.Advances in Neural Information Process- ing Systems, 35:20703–20716, 2022

Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving.Advances in Neural Information Process- ing Systems, 35:20703–20716, 2022. 2

work page 2022

[11] [11]

UniAD: Unified perception and predic- tion for autonomous driving

Hanxue Hu, Ye Yuan, Hongyang Xu, Zhaoyang Chen, Ming Liang, Zhiding Li, Yuexin Ma, Xiaodong Shen, Yuning Chai, Xiaoqing Tan, et al. UniAD: Unified perception and predic- tion for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 1, 2

work page 2023

[12] [12]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. ViPE: Video pose engine for 3D geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Safedreamer: Safe reinforcement learn- ing with world models.arXiv preprint arXiv:2307.07176,

Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. Safedreamer: Safe reinforcement learn- ing with world models.arXiv preprint arXiv:2307.07176,

work page arXiv

[14] [14]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Cov- ington, Benjamin Sapp, et al. EMMA: End-to-end mul- timodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025a

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language- action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025. 2

work page arXiv 2025

[16] [16]

Reinforcement learning: A survey.Journal of artifi- cial intelligence research, 4:237–285, 1996

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artifi- cial intelligence research, 4:237–285, 1996. 2

work page 1996

[17] [17]

Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025. 1

work page 2025

[18] [18]

OpenBox: Annotate any bound- ing boxes in 3d

In-Jae Lee, Mungyeom Kim, Kwonyoung Ryu, Pierre Musacchio, and Jaesik Park. OpenBox: Annotate any bound- ing boxes in 3d. InProceedings of the Int. Conf. on Neural Information Processing Systems (NeurIPS), 2025. 9

work page 2025

[19] [19]

Latent Visual Reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2). InEuropean Conference on Computer Vision, pages 142–158. Springer, 2024. 2

work page 2024

[21] [21]

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 5

work page 2024

[22] [22]

Dreamdrive: Generative 4d scene modeling from street view images

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. In2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 367–374. IEEE, 2025. 2 9

work page 2025

[23] [23]

Physical AI autonomous vehicles dataset.https: / / huggingface

NVIDIA. Physical AI autonomous vehicles dataset.https: / / huggingface . co / datasets / nvidia / PhysicalAI - Autonomous-Vehicles, 2025. 2, 5, 6, 7

work page 2025

[24] [24]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yi- fan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Don- gran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Ja- son Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinge...

work page internal anchor Pith review arXiv 2025

[25] [25]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2023. 6

work page 2023

[26] [26]

Better Call SAL: Towards learning to segment anything in lidar

Aljosa Osep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, and Laura Leal-Taixé. Better Call SAL: Towards learning to segment anything in lidar. InEu- ropean Conference on Computer Vision (ECCV), 2024. 9

work page 2024

[27] [27]

Mitigating covariate shift in imitation learning for au- tonomous vehicles using latent space generative world mod- els.arXiv preprint arXiv:2409.16663, 2024

Alexander Popov, Alperen Degirmenci, David Wehr, Shashank Hegde, Ryan Oldja, Alexey Kamenev, Bertrand Douillard, David Nistér, Urs Muller, Ruchi Bhargava, et al. Mitigating covariate shift in imitation learning for au- tonomous vehicles using latent space generative world mod- els.arXiv preprint arXiv:2409.16663, 2024. 2

work page arXiv 2024

[28] [28]

Qwen3-VL: Sharper vision, deeper thought, broader action.https : / / qwen

Qwen Team. Qwen3-VL: Sharper vision, deeper thought, broader action.https : / / qwen . ai / blog ? id = 99f0335c4ad9ff6153e517418d48535ab6d8afef & from = research.latest-advancements-list, 2025. 3

work page 2025

[29] [29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, et al. Mas- tering Atari, Go, Chess and Shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019. 2

work page internal anchor Pith review arXiv 1911

[31] [31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Di- anat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025. 2

work page arXiv 2025

[33] [33]

Tokenize the world into object-level knowledge to address long-tail events in autonomous driving

Ran Tian, Boyi Li, Xinshuo Weng, Yuxiao Chen, Edward Schmerling, Yue Wang, Boris Ivanovic, and Marco Pavone. Tokenize the world into object-level knowledge to address long-tail events in autonomous driving. InConference on Robot Learning, 2024. 1, 2

work page 2024

[34] [34]

DriveCoT: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996, 2024

Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, and Ping Luo. DriveCoT: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996, 2024. 1, 2

work page arXiv 2024

[35] [35]

Drivedreamer: Towards real-world- drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 2

work page 2024

[36] [36]

Chain-of-thought prompting elicits reasoning in large lan- guage models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. InAdvances in Neural Information Process- ing Systems, 2022. 1, 2

work page 2022

[37] [37]

PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449– 15458, 2024. 1, 2

work page 2024

[38] [38]

S4-driver: Scalable self-supervised driving mul- timodal large language model with spatio-temporal visual representation

Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, et al. S4-driver: Scalable self-supervised driving mul- timodal large language model with spatio-temporal visual representation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1622–1632, 2025. 2

work page 2025

[39] [39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

OpenDriveVLA: Towards end- to-end autonomous driving with large vision language action model.arXiv preprint arXiv:2503.23463,

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, and Alois C Knoll. OpenDriveVLA: Towards end-to-end au- tonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463, 2025. 2

work page arXiv 2025

[41] [41]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. AutoVLA: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025. 1, 2 10 A. Additional Implementation Details A.1. Latent World Model Encoder Our latent world model (...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

This produces a sequence of per- agent, per-timestep features of shapeR B×N×T×d lwm

a learned timestep embedding added along the temporal axis; 2) an agent-type embedding (shared over timesteps) added per agent; 3) a stack of MLP residual blocks along the feature dimension. This produces a sequence of per- agent, per-timestep features of shapeR B×N×T×d lwm. Temporal pooling per agent.To summarize theT=10 timesteps into a single feature p...

work page

[43] [43]

This means that even though the rea- soning branches provide two candidate future plans, the de- coder does not simply copy a branch

Final actions improve upon the reasoning propos- als.In both settings, we observe that Final-Action Quality <Reasoning Quality. This means that even though the rea- soning branches provide two candidate future plans, the de- coder does not simply copy a branch. Instead, it selects the more promising proposal and furtherrefinesit to produce a more accurate...

work page

[44] [44]

This shows that the proposal actions are actively used

Strong alignment between reasoning proposals and the final action.Across both models, the Reasoning– Action Alignment score remains small, indicating that the final trajectory lies close to at least one of the proposal branches. This shows that the proposal actions are actively used. After RL, the alignment improves (0.614→0.581), indicating that RL stren...

work page

[45] [45]

This is es- sential in multi-agent driving scenarios with inherent un- certainty

Reasoning branches maintain meaningful diver- sity.The Diversity score for both models indicates the two branches represent distinct motion hypotheses. This is es- sential in multi-agent driving scenarios with inherent un- certainty. RL slightly reduces diversity (0.412→0.353), but the branches remain significantly different. In other words, RL makes expl...

work page

[46] [46]

Introducing even a minimal amount of latent reasoning (e.g.,K=1,B=2with 24 tokens) pro- duces a clear reduction in ADE

Latent CoT provides consistent improvements over the baselineThe leftmost point corresponds to the non- reasoning model. Introducing even a minimal amount of latent reasoning (e.g.,K=1,B=2with 24 tokens) pro- duces a clear reduction in ADE. This demonstrates that a small number of interleaved action-proposal and latent world-model tokens already provides ...

work page

[47] [47]

The largest gains are obtained when moving from shallow reasoning (e.g., K=1,2) to larger reasoning depth (K=3–5)

Increasing reasoning budget yields meaningful gainsAs we increase(K, B), performance improves smoothly, indicating that deeper latent reasoning enables the model to explore more steps into the future and pro- duce better action plans based on that. The largest gains are obtained when moving from shallow reasoning (e.g., K=1,2) to larger reasoning depth (K...

work page

[48] [48]

Models with multiple branches (e.g., K=5, B=2) outperform the one with the same depth but fewer branches (e.g.,K=5, B=1)

Branching (B) leads to complementary improve- ments to depth (K)Branches encourage diverse coun- terfactual futures. Models with multiple branches (e.g., K=5, B=2) outperform the one with the same depth but fewer branches (e.g.,K=5, B=1). This aligns with our diversity analysis: exploring alternative counterfactual fu- tures provides richer reasoning sign...

work page