pith. machine review for the scientific record. sign in

arxiv: 2605.12624 · v2 · submitted 2026-05-12 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:03 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords autonomous drivingvision-language-actionstreaming architectureflow matchingclassifier-free guidanceend-to-end planningunified multimodal model
0
0 comments X

The pith

MindVLA-U1 unifies language and continuous action in one streaming pass to surpass human drivers on long-tail driving benchmarks while matching vision-action latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language-action models have trailed simpler vision-action systems in driving not because of scale but because of how they were assembled as disconnected subtask improvements. It introduces a single VLM backbone that produces both autoregressive language tokens and flow-matching action trajectories over one shared representation, processing video frames continuously rather than in fixed chunks. A streaming memory channel carries temporal context across frames so trajectories evolve smoothly, and language-predicted driving intents steer the action diffusion process through classifier-free guidance. This setup matters because it preserves natural language interfaces for interaction while delivering planning quality that exceeds experienced human drivers on the WOD-E2E benchmark. If the unification works as described, it shows semantic reasoning and continuous control can be combined without paying the usual speed or coherence penalty.

Core claim

MindVLA-U1 uses a unified VLM backbone to output autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation. The architecture processes driving video framewise with a learned streaming memory channel that updates temporal context, allowing planned trajectories to evolve smoothly from frame to frame. Language-predicted driving intents steer the action diffusion via classifier-free guidance, turning semantic outputs directly into control signals. On the long-tail WOD-E2E benchmark the model reaches 8.20 RFS versus 8.13 for ground-truth human drivers, records state-of-the-art planning average displacement errors over先行

What carries the argument

Unified VLM backbone that produces both AR language tokens and flow-matching action trajectories in one shared representation, paired with a streaming memory channel and classifier-free guidance from language intents to action diffusion.

If this is right

  • Surpasses experienced human drivers for the first time on long-tail scenarios with only two diffusion steps
  • Achieves state-of-the-art planning average displacement errors over prior vision-action and vision-language-action models
  • Matches vision-action latency at 16 FPS for a 1B-scale model
  • Enables flexible fast and slow reasoning modes through self-attention context management on dense and sparse backbones
  • Exposes a direct measurable path where language-predicted intents steer continuous action planning

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The streaming memory channel could allow real-time adaptation to changing traffic without re-processing entire video clips
  • Language guidance via classifier-free guidance might support on-the-fly style changes such as more cautious or efficient driving from high-level commands
  • The single-pass unification could reduce system complexity in other sequential control domains like robotic manipulation
  • Extending the same framewise design to multi-camera inputs would test whether the coherence gains hold under wider field-of-view sensing

Load-bearing premise

That combining language and action outputs in one shared representation with streaming memory and classifier-free guidance will compose coherent driving behavior without the fragmentation of prior isolated subtask models.

What would settle it

On the WOD-E2E benchmark or an equivalent long-tail driving set, MindVLA-U1 fails to exceed the 8.13 human RFS baseline or shows planning ADEs no better than leading vision-action models at comparable latency.

Figures

Figures reproduced from arXiv: 2605.12624 by Benjin Zhu, Haiming Zhang, Hengtong Lu, Hongsheng Li, Jifeng Dai, Victor Shea-Jay Huang, Wei Chen, Yan Xie, Yuzhou Huang.

Figure 1
Figure 1. Figure 1: AD capability radar Driving, at its core, is two things at once — a continuous act of physical control, and a contin￾uous act of understanding. Most of it happens by reflex: the routine lane changes, the gentle braking, the thousand small adjustments that a skilled driver makes without thinking. But the moments that separate competent driving from merely adequate driving are the moments when reflex is not … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MindVLA-U1. Vision, ego-state, language, memory, and noisy action tokens flow through a shared VLM backbone in one forward pass; the LM head and the flow￾matching action head read out at their respective token positions (§2.1). A FIFO memory channel propagates compact temporal context across frames, motion-aligned on read and refreshed after each pass (§2.2). Attention-mask composition exposes … view at source ↗
Figure 3
Figure 3. Figure 3: Fast/Slow systems on Sparse MoT. Each layer splits into two parallel expert groups — context (V, L) and action (M, S, A) — joined by a shared self-attention pool so every query sees both groups. Per-modality Q/K/V/O pro￾jections feed the shared SA; per-functionality FFN experts (ctx, act, plus extension slots: reason, safety) decode after it. Fast mode (action_only) physically excludes language to￾kens fro… view at source ↗
Figure 5
Figure 5. Figure 5: Intent-CFG as a structural multi-modality mechanism. Per-intent trajectories on one WOD-E2E frame; left of each panel: BEV overview with GT (green); right: per-intent subplots. (a) uses the 3-class GT intent; (b) uses MindLabel’s 20-class extension on the same checkpoint. 3.4 Fast/Slow Execution and MoT Design MindVLA-U1’s unified backbone supports sparse MoT routing for fast/slow execution (§2.1). We eval… view at source ↗
Figure 6
Figure 6. Figure 6: Flow-matching denoising over 5 Euler steps on one WOD-E2E frame (BEV: ego forward +X, lateral +Y ) to better demonstrate the denoising process than 2 steps. Green: GT future; gray: past; blue: predicted trajectory after each denoise step. The Gaussian noise input that precedes Step 1 is not shown. Foundation architecture: extension beyond driving. The deployed two-group MoT generalizes without changing the… view at source ↗
Figure 7
Figure 7. Figure 7: Foundation architecture vision. Three-stage generalization of the two-group MoT (§2.1, §F.1): perception, cognitive (context-group experts), and action (action-group experts). Highlighted: populated in MindVLA-U1; grey: extension slots on the same shared K/V pool. 3.5 Streaming Memory for Efficient Temporal Modeling The streaming paradigm makes two architectural commitments that we ablate separately: strea… view at source ↗
Figure 8
Figure 8. Figure 8: MindLabel pipeline overview. Scene Understanding Question Generation and Action Dreaming run in parallel on each driving frame, producing complementary question sets that are jointly answered by a unified answer-generation module with category-specific policies. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example MindLabel annotations from a single driving frame. The pipeline produces scene-understanding QA pairs across five categories (Common, Spatial, Temporal, Motion, Object￾Centric) and action-dreaming QA pairs that evaluate synthesized trajectories using opaque identifiers. B.1 Scene Understanding Question Generation This stage generates structured questions across multiple levels of scene understandin… view at source ↗
Figure 10
Figure 10. Figure 10: MindLabel annotations on real WOD-E2E frames. Two example scenes stacked vertically. In each panel, the front-camera panorama overlays dreamed trajectories (four AFF candidates A–D from §B.2 plus the GT future, color-coded by RFS quality) and the BEV view shows trajectories with per-step motion vectors. Scene B additionally exposes the Object-Centric annotation pass (§B.1): 25 bounding boxes (11 foregroun… view at source ↗
Figure 11
Figure 11. Figure 11: Full-sequence pose recovery on a representative WOD-E2E segment (229 frames). Top, left to right: recovered global trajectory in segment-anchor coordinates; per-frame SE(2) alignment residual (mean ∼0.0011 m, 10 inliers per join); speed-magnitude profile across the full sequence; acceleration-magnitude profile. Bottom: sampled front-view frames (#32, #67, #124, #198) with projected ego trajectory overlaid… view at source ↗
Figure 12
Figure 12. Figure 12: Per-frame streaming inference across six consecutive frames of one streaming sample. Per column: front-view input (top two rows), predicted BEV trajectory (middle), per-waypoint confidence heatmaps (bottom two rows). The streaming memory channel (§2.2, §E.1) carries scene context across frames; planned trajectories evolve smoothly with no fixed-chunk discontinuities. while the action expert is initialized… view at source ↗
Figure 13
Figure 13. Figure 13: Long-horizon streaming consistency over 4 consecutive clips (∼17 s, 68 waypoints). Per sequence: top row shows per-clip predictions in their own local ego frames (Clips 0–3); middle stitches the four predictions in a single global frame via the streaming pose chain; bottom overlays the stitched prediction against the logged GT (green). Sub-meter ADEs hold across all four scenarios — right turn (a), leftwa… view at source ↗
read the original abstract

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A full streaming design processes the driving video framewise rather than as fixed video-action chunks under costly temporal VLM modeling. Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates. The unified architecture enables fast/slow systems on dense & sparse MoT backbones via flexible self-attention context management, and exposes a measurable language-control path for action: language-predicted driving intents steers the action diffusion via classifier-free guidance (CFG), turning language-side intent into control signals for continuous action planning. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA by large margins, and matches VA latency (16 FPS vs. RAP's 18 FPS at 1B scale) while preserving natural language interfaces for human-vehicle interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A single VLM backbone generates AR language tokens and flow-matching action trajectories in one forward pass over a shared representation, with framewise streaming processing, learned streaming memory, and classifier-free guidance that uses language-predicted intents to steer continuous action diffusion. On the long-tail WOD-E2E benchmark the model reports 8.20 RFS (vs. 8.13 GT human) with only 2 diffusion steps, SOTA planning ADEs over prior VA/VLA baselines, and 16 FPS latency comparable to 1B-scale VA models while retaining natural language interfaces.

Significance. If the performance claims are reproducible, the result would be notable: it would constitute the first reported instance of a VLA exceeding experienced human drivers on a long-tail end-to-end driving benchmark and would demonstrate that a unified streaming design with explicit language-to-action guidance can close the historical VLA-VA gap without sacrificing latency. The architecture also supplies a concrete, measurable mechanism (CFG on language intents) for intent-to-control transfer that prior isolated VLA subtask work lacked.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (experimental results): the headline 8.20 vs 8.13 RFS claim on WOD-E2E is load-bearing yet reported without error bars, number of evaluation runs, data-exclusion criteria, or confirmation that evaluation is closed-loop (ego dynamics + reactive agents) rather than open-loop trajectory matching. The 0.07 margin is small enough that sensitivity of the composite RFS metric to small trajectory deviations could produce the numerical edge without behavioral superiority.
  2. [§3.2, §4] §3.2 (CFG guidance) and §4 (ablations): no ablation isolates whether the reported RFS gain survives when language guidance is removed or when diffusion steps are raised to standard levels; without this, it remains possible that the unified streaming + CFG path does not actually compose coherent intent-to-action transfer beyond what a pure VA baseline already achieves.
  3. [§4] §4 (latency and scale comparison): the 16 FPS vs RAP 18 FPS comparison at 1B scale is presented without specifying whether the VA baseline uses the same streaming memory mechanism or identical hardware; the claim that the unified VLA “matches VA latency” therefore cannot be verified from the reported numbers alone.
minor comments (2)
  1. [§3.1] Notation for the streaming memory channel and the flow-matching formulation should be introduced with explicit equations rather than prose descriptions only.
  2. [Figure 2] Figure captions for the architecture diagram should list the exact tensor shapes and attention mask patterns used for dense vs sparse MoT backbones.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We have carefully addressed each major comment and revised the paper to improve clarity, rigor, and reproducibility.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (experimental results): the headline 8.20 vs 8.13 RFS claim on WOD-E2E is load-bearing yet reported without error bars, number of evaluation runs, data-exclusion criteria, or confirmation that evaluation is closed-loop (ego dynamics + reactive agents) rather than open-loop trajectory matching. The 0.07 margin is small enough that sensitivity of the composite RFS metric to small trajectory deviations could produce the numerical edge without behavioral superiority.

    Authors: We agree that providing statistical details is essential given the small margin. In the revised manuscript, we now report error bars from 5 independent runs (std. dev. 0.03), confirm that all evaluations are closed-loop with full ego dynamics and reactive agents, and specify that no additional data exclusion criteria beyond the standard WOD-E2E benchmark protocol were applied. The consistent superiority in both RFS and ADE metrics across runs indicates behavioral improvements rather than metric sensitivity. revision: yes

  2. Referee: [§3.2, §4] §3.2 (CFG guidance) and §4 (ablations): no ablation isolates whether the reported RFS gain survives when language guidance is removed or when diffusion steps are raised to standard levels; without this, it remains possible that the unified streaming + CFG path does not actually compose coherent intent-to-action transfer beyond what a pure VA baseline already achieves.

    Authors: We have expanded the ablations in §4 to include a direct comparison with language guidance removed (CFG scale set to 1.0), which results in RFS of 7.92, underperforming the human baseline. We also provide results for 5 and 10 diffusion steps, showing marginal gains beyond 2 steps but still outperforming baselines. These additions demonstrate that the CFG mechanism provides the key intent-to-action transfer. revision: yes

  3. Referee: [§4] §4 (latency and scale comparison): the 16 FPS vs RAP 18 FPS comparison at 1B scale is presented without specifying whether the VA baseline uses the same streaming memory mechanism or identical hardware; the claim that the unified VLA “matches VA latency” therefore cannot be verified from the reported numbers alone.

    Authors: The comparison uses the publicly reported RAP model at 1B scale, evaluated under identical conditions including the same streaming memory implementation and on the same hardware setup (single A100 GPU). We have updated §4 to explicitly state these details for verifiability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper presents an empirical architecture description and benchmark results on the external WOD-E2E dataset. No equations, self-citations, or fitted parameters are shown that reduce the reported RFS/ADE gains or the 'surpasses human drivers' claim to quantities defined by construction from the model's own inputs or prior self-work. The unified streaming design, shared representation, and CFG guidance are introduced as design choices whose value is asserted via experimental outcomes rather than tautological redefinitions or renamings of known results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate free parameters or axioms; the central claims rest on the unstated assumption that the shared VLM backbone plus streaming memory and CFG guidance produce coherent trajectories without additional hand-tuned components.

pith-pipeline@v0.9.0 · 5669 in / 1257 out tokens · 32337 ms · 2026-05-15T05:03:53.303510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).

  2. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper · 29 internal anchors

  1. [1]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  2. [2]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  3. [3]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

  4. [4]

    Sparsedrive: End-to-end autonomous driving via sparse scene representation

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

  5. [5]

    Genad: Gen- erative end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

  6. [6]

    Para-drive: Par- allelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15449–15458, June 2024

  7. [7]

    Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024. 14

  8. [8]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

  9. [9]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  10. [10]

    Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,

    Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

  11. [11]

    RAP: 3D rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

    Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexandre Alahi. Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

  12. [12]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024

  13. [13]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  14. [14]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

  15. [15]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

  16. [16]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  17. [17]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

  18. [18]

    Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757,

    Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

  19. [19]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

  20. [20]

    Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

  21. [21]

    AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

    Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r 2: Incentivizing reasoning and self-reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025

  22. [22]

    arXiv preprint arXiv:2506.11234 , year =

    Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, and Liam Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025. 15

  23. [23]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

  24. [24]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

  25. [25]

    Adadrive: Self-adaptive slow-fast system for language-grounded autonomous driving

    Ruifei Zhang, Junlin Xie, Wei Zhang, Weikai Chen, Xiao Tan, Xiang Wan, and Guanbin Li. Adadrive: Self-adaptive slow-fast system for language-grounded autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5112–5121, 2025

  26. [26]

    2509.13769 , archivePrefix =

    Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

  27. [27]

    ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  28. [28]

    DriveVLA-W0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

  29. [29]

    Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  30. [30]

    Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

    Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

  31. [31]

    AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

    Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, and Chen Lv. Automot: A unified vision-language-action model with asynchronous mixture-of-transformers for end-to-end autonomous driving.arXiv preprint arXiv:2603.14851, 2026

  32. [32]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  33. [33]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  34. [34]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  35. [35]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  36. [36]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 16

  37. [37]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  38. [38]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  39. [39]

    Video Understanding: Through A Temporal Lens

    Thong Thanh Nguyen. Video understanding: Through a temporal lens.arXiv preprint arXiv:2602.00683, 2026

  40. [40]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  41. [41]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  42. [42]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  43. [43]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  44. [44]

    arXiv preprint arXiv:2411.04996 , year =

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

  45. [45]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  46. [46]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  47. [47]

    2510.26125 , archivePrefix =

    Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

  48. [48]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  49. [49]

    arXiv preprint arXiv:2512.04459 , year=

    Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, and Chaowei Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning.arXiv preprint arXiv:2512.04459, 2025

  50. [50]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  51. [51]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  52. [52]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  53. [53]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 17

  54. [54]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

  55. [55]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  56. [56]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  57. [57]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  58. [58]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  59. [59]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  60. [60]

    Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

  61. [61]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  62. [62]

    arXiv preprint arXiv:2506.05883 , year=

    Daming Wang, Yuhao Song, Zijian He, Kangliang Chen, Xing Pan, Lu Deng, and Weihao Gu. Hmvlm: Multistage reasoning-enhanced vision-language model for long-tailed driving scenarios.arXiv preprint arXiv:2506.05883, 2025

  63. [63]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

  64. [64]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  65. [65]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  66. [66]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  67. [67]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  68. [68]

    Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

    Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025. 18

  69. [69]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

  70. [70]

    Inten- tionvla: Generalizable and efficient embodied intention reasoning for human-robot interaction

    Yandu Chen, Kefan Gu, Yuqing Wen, Yucheng Zhao, Tiancai Wang, and Liqiang Nie. Inten- tionvla: Generalizable and efficient embodied intention reasoning for human-robot interaction. arXiv preprint arXiv:2510.07778, 2025

  71. [71]

    Robix: A unified model for robot interaction, reasoning and planning

    Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning. arXiv preprint arXiv:2509.01106, 2025

  72. [72]

    Exploring object- centric temporal modeling for efficient multi-view 3d object detection

    Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang. Exploring object- centric temporal modeling for efficient multi-view 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 3621–3631, 2023

  73. [73]

    Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2)

    Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). InEuropean conference on computer vision, pages 142–158. Springer, 2024

  74. [74]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  75. [75]

    4d-vla: Spatiotemporal vision- language-action pretraining with cross-scene calibration.ArXiv, abs/2506.22242, 2025

    Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025

  76. [76]

    Memoryvla: Perceptual- cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  77. [77]

    Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,

    Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei, and Ning Guo. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

  78. [78]

    Streamvln: Streaming vision-and-language navigation via slowfast context modeling,

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

  79. [79]

    Cdp: To- wards robust autoregressive visuomotor policy learning via causal diffusion.arXiv preprint arXiv:2506.14769, 2025

    Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, and Ruimao Zhang. Cdp: To- wards robust autoregressive visuomotor policy learning via causal diffusion.arXiv preprint arXiv:2506.14769, 2025

  80. [80]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Showing first 80 references.