arxiv: 2605.12624 · v1 · submitted 2026-05-12 · 💻 cs.RO · cs.CV

Recognition: unknown

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Benjin Zhu, Haiming Zhang, Hengtong Lu, Hongsheng Li, Jifeng Dai, Victor Shea-Jay Huang, Wei Chen, Yan Xie, Yuzhou Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:51 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords autonomous drivingvision-language-actionunified architecturestreaming modelflow-matching trajectoriesend-to-end planningclassifier-free guidance

0 comments

The pith

A unified streaming VLA model with shared language-action backbone surpasses VA methods and experienced human drivers on long-tail autonomous driving benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that VLA models have lagged VA ones not because of scale but because prior designs treated semantic reasoning and control as isolated pieces that never composed well. MindVLA-U1 instead uses one VLM backbone to output both autoregressive language tokens and flow-matching action trajectories in a single forward pass over shared representations. A streaming frame-by-frame design plus a learned memory channel carries context across time so trajectories evolve smoothly without re-encoding full video chunks. Language-predicted driving intent is turned into a control signal for the action diffusion via classifier-free guidance. This combination produces planning results that exceed human drivers on the WOD-E2E benchmark while keeping throughput competitive with pure vision-action models.

Core claim

MindVLA-U1 is the first unified streaming VLA architecture for autonomous driving in which a single shared VLM backbone produces autoregressive language tokens and flow-matching continuous action trajectories in one forward pass, a streaming framewise process with learned memory channel maintains temporal context, and language-predicted intent steers action generation through classifier-free guidance, yielding 8.20 RFS on WOD-E2E that exceeds experienced human drivers at 8.13 while running at 16 FPS.

What carries the argument

The unified VLM backbone that produces both autoregressive language tokens and flow-matching action trajectories in a single forward pass over one shared representation, together with streaming framewise processing and a learned memory channel for temporal context.

If this is right

Planning ADEs improve by large margins over prior VA and VLA methods while using only two diffusion steps.
Natural-language interfaces become usable for steering continuous trajectories without dropping below VA-class throughput.
Flexible self-attention context management on MoT backbones enables fast or slow execution modes within the same model.
Planned trajectories evolve smoothly across frames because the memory channel carries context without redundant multi-frame modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time systems could accept high-level language commands as direct control signals rather than post-processing outputs from separate planners.
Removing the boundary between language reasoning and low-level control may reduce error accumulation that occurs when separate modules hand off information.
The same architecture could be tested on embodied tasks outside driving to see whether unified streaming plus intent guidance generalizes beyond vehicle motion.

Load-bearing premise

The WOD-E2E benchmark and its RFS metric accurately reflect real-world driving safety and the reported gains come primarily from the unified streaming architecture rather than benchmark-specific tuning.

What would settle it

An ablation on WOD-E2E or another driving dataset in which removing the unified backbone, the streaming memory channel, or the language-to-action CFG path eliminates the performance margin over strong VA baselines.

Figures

Figures reproduced from arXiv: 2605.12624 by Benjin Zhu, Haiming Zhang, Hengtong Lu, Hongsheng Li, Jifeng Dai, Victor Shea-Jay Huang, Wei Chen, Yan Xie, Yuzhou Huang.

**Figure 1.** Figure 1: AD capability radar Driving, at its core, is two things at once — a continuous act of physical control, and a continuous act of understanding. Most of it happens by reflex: the routine lane changes, the gentle braking, the thousand small adjustments that a skilled driver makes without thinking. But the moments that separate competent driving from merely adequate driving are the moments when reflex is not … view at source ↗

**Figure 2.** Figure 2: Overview of MindVLA-U1. Vision, ego-state, language, memory, and noisy action tokens flow through a shared VLM backbone in one forward pass; the LM head and the flowmatching action head read out at their respective token positions (§2.1). A FIFO memory channel propagates compact temporal context across frames, motion-aligned on read and refreshed after each pass (§2.2). Attention-mask composition exposes … view at source ↗

**Figure 3.** Figure 3: Fast/Slow systems on Sparse MoT. Each layer splits into two parallel expert groups — context (V, L) and action (M, S, A) — joined by a shared self-attention pool so every query sees both groups. Per-modality Q/K/V/O projections feed the shared SA; per-functionality FFN experts (ctx, act, plus extension slots: reason, safety) decode after it. Fast mode (action_only) physically excludes language tokens fro… view at source ↗

**Figure 5.** Figure 5: Intent-CFG as a structural multi-modality mechanism. Per-intent trajectories on one WOD-E2E frame; left of each panel: BEV overview with GT (green); right: per-intent subplots. (a) uses the 3-class GT intent; (b) uses MindLabel’s 20-class extension on the same checkpoint. 3.4 Fast/Slow Execution and MoT Design MindVLA-U1’s unified backbone supports sparse MoT routing for fast/slow execution (§2.1). We eval… view at source ↗

**Figure 6.** Figure 6: Flow-matching denoising over 5 Euler steps on one WOD-E2E frame (BEV: ego forward +X, lateral +Y ) to better demonstrate the denoising process than 2 steps. Green: GT future; gray: past; blue: predicted trajectory after each denoise step. The Gaussian noise input that precedes Step 1 is not shown. Foundation architecture: extension beyond driving. The deployed two-group MoT generalizes without changing the… view at source ↗

**Figure 7.** Figure 7: Foundation architecture vision. Three-stage generalization of the two-group MoT (§2.1, §F.1): perception, cognitive (context-group experts), and action (action-group experts). Highlighted: populated in MindVLA-U1; grey: extension slots on the same shared K/V pool. 3.5 Streaming Memory for Efficient Temporal Modeling The streaming paradigm makes two architectural commitments that we ablate separately: strea… view at source ↗

**Figure 8.** Figure 8: MindLabel pipeline overview. Scene Understanding Question Generation and Action Dreaming run in parallel on each driving frame, producing complementary question sets that are jointly answered by a unified answer-generation module with category-specific policies. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Example MindLabel annotations from a single driving frame. The pipeline produces scene-understanding QA pairs across five categories (Common, Spatial, Temporal, Motion, ObjectCentric) and action-dreaming QA pairs that evaluate synthesized trajectories using opaque identifiers. B.1 Scene Understanding Question Generation This stage generates structured questions across multiple levels of scene understandin… view at source ↗

**Figure 10.** Figure 10: MindLabel annotations on real WOD-E2E frames. Two example scenes stacked vertically. In each panel, the front-camera panorama overlays dreamed trajectories (four AFF candidates A–D from §B.2 plus the GT future, color-coded by RFS quality) and the BEV view shows trajectories with per-step motion vectors. Scene B additionally exposes the Object-Centric annotation pass (§B.1): 25 bounding boxes (11 foregroun… view at source ↗

**Figure 11.** Figure 11: Full-sequence pose recovery on a representative WOD-E2E segment (229 frames). Top, left to right: recovered global trajectory in segment-anchor coordinates; per-frame SE(2) alignment residual (mean ∼0.0011 m, 10 inliers per join); speed-magnitude profile across the full sequence; acceleration-magnitude profile. Bottom: sampled front-view frames (#32, #67, #124, #198) with projected ego trajectory overlaid… view at source ↗

**Figure 12.** Figure 12: Per-frame streaming inference across six consecutive frames of one streaming sample. Per column: front-view input (top two rows), predicted BEV trajectory (middle), per-waypoint confidence heatmaps (bottom two rows). The streaming memory channel (§2.2, §E.1) carries scene context across frames; planned trajectories evolve smoothly with no fixed-chunk discontinuities. while the action expert is initialized… view at source ↗

**Figure 13.** Figure 13: Long-horizon streaming consistency over 4 consecutive clips (∼17 s, 68 waypoints). Per sequence: top row shows per-clip predictions in their own local ego frames (Clips 0–3); middle stitches the four predictions in a single global frame via the streaming pose chain; bottom overlays the stitched prediction against the logged GT (green). Sub-meter ADEs hold across all four scenarios — right turn (a), leftwa… view at source ↗

read the original abstract

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose into coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A streaming design processes the driving video framewise rather than as fixed video-action chunks, while a learned memory channel carries temporal context across frames so planned trajectories evolve smoothly without redundant multi-frame VLM modeling. The unified architecture admits fast/slow execution on dense/sparse Mixture-of-Transformers (MoT) backbones via flexible self-attention context management, and exposes a measurable language-to-action route: a language-predicted driving intent steers action diffusion through classifier-free guidance (CFG), turning language-side intent into a control signal for continuous trajectory generation. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA methods by large margins, and matches VA-class throughput (16 FPS vs. RAP-DINO's 18 FPS) while preserving natural-language interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MindVLA-U1 unifies language and flow-matching actions in a streaming VLM with memory and CFG steering, reports a small RFS edge over humans, but the evidence for why the architecture wins is thin.

read the letter

The main point is that this paper puts forward a single VLM backbone that handles both autoregressive language tokens and flow-matching action trajectories in one pass, with framewise streaming via a learned memory channel and classifier-free guidance to let language intent steer the actions. On the WOD-E2E long-tail benchmark it claims 8.20 RFS against 8.13 for ground-truth human trajectories, SOTA planning ADEs, and 16 FPS throughput with only two diffusion steps. That combination of unification and streaming is the concrete step beyond prior VA and VLA work that treated the modalities more separately. The design keeps natural language output while matching VA-class speed, which is useful for systems that need both planning quality and some interpretability. The MoT backbone for flexible context management is a practical detail for real-time driving. The soft spots sit in the experimental grounding. The abstract gives the headline numbers but no ablations that isolate the memory channel or the CFG route from data choices, optimizer details, or post-processing. The 0.07 RFS margin is small, and without checks on how the metric weights rare events or whether the human baselines were collected under comparable perception conditions, it is hard to attribute the gain to the architecture rather than tuning. Statistical significance and baseline implementation details are also missing, which leaves the soundness low on the central claim. This is for groups already working on end-to-end AV models who want to keep language interfaces in the loop. A reader focused on VLA integration or streaming control would pick up usable design ideas even if they have to re-run the experiments themselves. I would send it to peer review. The topic matters and the unification approach is worth checking, but referees need to see the missing ablations and metric validation before the human-beating result can be taken as settled.

Referee Report

3 major / 2 minor

Summary. The paper introduces MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. It uses a single VLM backbone to generate autoregressive language tokens and flow-matching continuous action trajectories in one forward pass, incorporates a streaming design with a learned memory channel for temporal context across frames, supports fast/slow execution via MoT backbones, and employs language-predicted intent with classifier-free guidance to steer action diffusion. On the WOD-E2E benchmark, it reports surpassing experienced human drivers (8.20 RFS vs. 8.13 GT RFS) with only 2 diffusion steps, state-of-the-art planning ADEs over prior VA/VLA methods, and competitive throughput (16 FPS).

Significance. If the reported margins hold under rigorous verification, the result would represent a notable advance by demonstrating that a unified VLA can exceed human performance on a long-tail end-to-end driving benchmark while preserving language interfaces and matching VA-class speed. The streaming memory and language-to-action CFG mechanisms, if isolated as causal, could influence future multimodal control architectures.

major comments (3)

[Abstract / Results] Abstract and results section: the central claim that MindVLA-U1 surpasses humans (8.20 RFS vs. 8.13 GT RFS) on WOD-E2E rests on the RFS metric being a calibrated, safety-predictive scalar whose 0.07 margin is robust; the manuscript provides no validation of RFS against real-world safety outcomes, no sensitivity analysis to evaluation choices, and no confirmation that GT human trajectories were collected under equivalent perception noise.
[Methods / Experiments] Methods and experiments: no ablation studies isolate the contribution of the learned memory channel or the language-to-action CFG route from confounding factors such as data curation, optimizer schedule, or post-processing; without these, the attribution of gains to the unified streaming architecture remains unverified.
[Experiments] Experiments: the abstract states large performance margins and SOTA ADEs over prior VA/VLA methods, yet supplies no details on baseline implementations, data splits, statistical significance testing, or variance across runs, preventing verification of the reported superiority.

minor comments (2)

[Abstract] The abstract claims 'state-of-the-art planning ADEs ... by large margins' without quoting the specific ADE values or naming the exact prior methods compared in the main results table.
[Methods] Notation for the streaming memory channel and MoT context management is introduced without an accompanying diagram or pseudocode, making the flexible self-attention mechanism difficult to reconstruct.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our claims. We will make revisions to address the major comments, including adding ablation studies, experimental details, and sensitivity analyses where feasible. Our responses to each major comment are provided below.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: the central claim that MindVLA-U1 surpasses humans (8.20 RFS vs. 8.13 GT RFS) on WOD-E2E rests on the RFS metric being a calibrated, safety-predictive scalar whose 0.07 margin is robust; the manuscript provides no validation of RFS against real-world safety outcomes, no sensitivity analysis to evaluation choices, and no confirmation that GT human trajectories were collected under equivalent perception noise.

Authors: The RFS metric is the official evaluation metric of the WOD-E2E benchmark, and our results follow the standard protocol provided by the benchmark organizers. The ground-truth (GT) human trajectories are from the same dataset and thus collected under the same perception conditions as the evaluation setup. We agree that a sensitivity analysis would be beneficial. In the revised manuscript, we will add a sensitivity analysis section discussing the robustness of the 0.07 margin to minor evaluation variations and include references to prior work validating RFS as a safety proxy. However, comprehensive real-world safety outcome validation is outside the scope of this benchmark study. revision: partial
Referee: [Methods / Experiments] Methods and experiments: no ablation studies isolate the contribution of the learned memory channel or the language-to-action CFG route from confounding factors such as data curation, optimizer schedule, or post-processing; without these, the attribution of gains to the unified streaming architecture remains unverified.

Authors: We recognize the importance of isolating the contributions of the learned memory channel and the language-to-action CFG. In the revised manuscript, we will include additional ablation experiments that systematically disable or modify these components while keeping other factors (such as data, training schedule, and post-processing) constant. This will provide clearer evidence for their roles in the performance gains. revision: yes
Referee: [Experiments] Experiments: the abstract states large performance margins and SOTA ADEs over prior VA/VLA methods, yet supplies no details on baseline implementations, data splits, statistical significance testing, or variance across runs, preventing verification of the reported superiority.

Authors: We will revise the experiments section to provide comprehensive details on baseline re-implementations, the precise data splits used from WOD-E2E, results from statistical significance tests (including p-values), and standard deviations or variances across multiple training runs with different random seeds. This additional information will facilitate independent verification of our reported results. revision: yes

standing simulated objections not resolved

Comprehensive validation of the RFS metric against real-world safety outcomes, which would require large-scale real-world testing and data not available within the current benchmark evaluation framework.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are independent of internal definitions

full rationale

The paper reports empirical performance gains on the external WOD-E2E benchmark using RFS and ADE metrics against human GT and prior methods. No equations, derivations, or first-principles predictions are presented that reduce reported scores to quantities defined by the authors' own fitted parameters or self-citations. The architecture description (unified VLM, streaming memory, CFG) is presented as an engineering choice whose value is measured externally rather than derived tautologically from the metrics themselves. Self-citations, if present, are not load-bearing for the central numerical claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the validity of the WOD-E2E benchmark as a proxy for real driving, the assumption that flow-matching and CFG can be stably combined in a single forward pass, and the effectiveness of the learned memory channel for temporal coherence; none of these receive independent verification outside the reported benchmark numbers.

free parameters (1)

number of diffusion steps
The result is reported specifically with 2 diffusion steps; this hyperparameter is chosen rather than derived.

axioms (1)

domain assumption The WOD-E2E benchmark and RFS metric constitute a faithful measure of autonomous driving quality
All performance claims are benchmarked exclusively against this dataset and metric.

invented entities (1)

learned memory channel no independent evidence
purpose: Carry temporal context across streaming frames without redundant multi-frame modeling
New component introduced to enable smooth trajectory evolution; no external evidence provided.

pith-pipeline@v0.9.0 · 5668 in / 1385 out tokens · 50307 ms · 2026-05-14T20:51:17.214806+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 51 canonical work pages · 25 internal anchors

[1]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023
[2]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023
[3]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

2025
[5]

Genad: Gen- erative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

2024
[6]

Para-drive: Par- allelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15449–15458, June 2024

2024
[7]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024. 14

2024
[8]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

2022
[9]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025
[10]

arXiv preprint arXiv:2501.15564 , year =

Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

work page arXiv 2025
[11]

Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexandre Alahi. Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

work page arXiv 2025
[12]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024

2024
[13]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Senna: Bridging large vision-language models and end-to- end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

work page arXiv 2024
[15]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

2024
[16]

Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page arXiv 2024
[17]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

2025
[18]

Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025
[19]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review arXiv 2025
[20]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

work page arXiv 2025
[21]

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r 2: Incentivizing reasoning and self-reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

arXiv preprint arXiv:2506.11234 (2025) 2, 8, 10

Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, and Liam Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025. 15

work page arXiv 2025
[23]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

2025
[24]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

2025
[25]

Adadrive: Self-adaptive slow-fast system for language-grounded autonomous driving

Ruifei Zhang, Junlin Xie, Wei Zhang, Weikai Chen, Xiao Tan, Xiang Wan, and Guanbin Li. Adadrive: Self-adaptive slow-fast system for language-grounded autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5112–5121, 2025

2025
[26]

Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025
[27]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page arXiv 2025
[28]

Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

work page arXiv 2025
[29]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page arXiv 2025
[30]

Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

work page arXiv 2025
[31]

Automot: A unified vision-language-action model with asynchronous mixture-of-transformers for end-to-end autonomous driving.arXiv preprint arXiv:2603.14851, 2026

Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, and Chen Lv. Automot: A unified vision-language-action model with asynchronous mixture-of-transformers for end-to-end autonomous driving.arXiv preprint arXiv:2603.14851, 2026

work page arXiv 2026
[32]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[33]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[36]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Video Understanding: Through A Temporal Lens

Thong Thanh Nguyen. Video understanding: Through a temporal lens.arXiv preprint arXiv:2602.00683, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[41]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[42]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[44]

arXiv preprint arXiv:2411.04996 , year =

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

work page arXiv 2024
[45]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025
[48]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[49]

arXiv preprint arXiv:2512.04459 , year=

Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, and Chaowei Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning.arXiv preprint arXiv:2512.04459, 2025

work page arXiv 2025
[50]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[51]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[52]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[53]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 17

2004
[54]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

2020
[55]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[56]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

2021
[58]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

work page arXiv 2025
[61]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

arXiv preprint arXiv:2506.05883 , year=

Daming Wang, Yuhao Song, Zijian He, Kangliang Chen, Xing Pan, Lu Deng, and Weihao Gu. Hmvlm: Multistage reasoning-enhanced vision-language model for long-tailed driving scenarios.arXiv preprint arXiv:2506.05883, 2025

work page arXiv 2025
[63]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017
[64]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[65]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[68]

Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025. 18

work page arXiv 2025
[69]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Inten- tionvla: Generalizable and efficient embodied intention reasoning for human-robot interaction

Yandu Chen, Kefan Gu, Yuqing Wen, Yucheng Zhao, Tiancai Wang, and Liqiang Nie. Inten- tionvla: Generalizable and efficient embodied intention reasoning for human-robot interaction. arXiv preprint arXiv:2510.07778, 2025

work page arXiv 2025
[71]

Robix: A unified model for robot interaction, reasoning and planning

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning. arXiv preprint arXiv:2509.01106, 2025

work page arXiv 2025
[72]

Exploring object- centric temporal modeling for efficient multi-view 3d object detection

Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang. Exploring object- centric temporal modeling for efficient multi-view 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 3621–3631, 2023

2023
[73]

Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). InEuropean conference on computer vision, pages 142–158. Springer, 2024

2024
[74]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025

work page arXiv 2025
[76]

Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page arXiv 2025
[77]

Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei, and Ning Guo. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

work page arXiv 2025
[78]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

work page arXiv 2025
[79]

Cdp: To- wards robust autoregressive visuomotor policy learning via causal diffusion.arXiv preprint arXiv:2506.14769, 2025

Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, and Ruimao Zhang. Cdp: To- wards robust autoregressive visuomotor policy learning via causal diffusion.arXiv preprint arXiv:2506.14769, 2025

work page arXiv 2025
[80]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

Showing first 80 references.