arxiv: 2603.25661 · v3 · submitted 2026-03-26 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Wenxuan Song , Jiayi Chen , Shuai Chen , Jingbo Wang , Pengxiang Ding , Han Zhao , Yikai Qin , Xinhu Zheng

show 3 more authors

Donglin Wang Yan Wang Haoang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:24 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords VLA modelssupervised finetuningorthogonal regularizationparameter mergingcapability vectorsrobot learningauxiliary tasks

0 comments

The pith

Merging parameters from dual-trained models on small task sets lets standard finetuning match auxiliary methods' performance at lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pretrained VLA models often fail to gain much from standard supervised finetuning while advanced auxiliary-objective methods add heavy compute. This paper decouples the goals of general capability gains and task-specific fitting directly in parameter space. It trains the model to convergence on a small task set under two separate strategies, treats the resulting parameter difference as auxiliary-task capability vectors, and merges those vectors into the pretrained weights. When standard finetuning is then run on the merged model with only a lightweight orthogonal regularization term, performance reaches levels comparable to full auxiliary baselines but with far less overhead. The method works across multiple robot tasks.

Core claim

By training the model to converge on a small-scale task set using two distinct training strategies, the difference between the resulting model parameters can be interpreted as capability vectors provided by auxiliary tasks. These vectors are merged with pretrained parameters to form a capability-enhanced meta model. When standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead.

What carries the argument

Capability vectors formed from the parameter difference between two training strategies on a small task set, merged into pretrained weights and paired with an orthogonal regularization loss during standard SFT.

Load-bearing premise

The difference between model parameters obtained from two distinct training strategies on a small-scale task set can be interpreted as capability vectors provided by auxiliary tasks.

What would settle it

An experiment in which the merged model plus orthogonal regularization fails to reach performance levels of auxiliary finetuned baselines on held-out robot tasks would disprove the central claim.

read the original abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: https://chris1220313648.github.io/Fast-dVLA/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The parameter-difference merging trick aims to deliver auxiliary VLA gains without extra losses, but the abstract supplies no numbers or ablations to show it actually works.

read the letter

The main thing to know is that the authors train two models to convergence on a small task set—one with standard SFT, one with auxiliary objectives—then treat the parameter difference as a capability vector that gets merged back into the pretrained weights, with a lightweight orthogonal regularization added during final SFT. This is meant to match auxiliary-finetuned performance at lower cost for discrete diffusion VLAs on robot tasks.

Referee Report

2 major / 2 minor

Summary. The paper proposes Fast-dVLA to accelerate discrete diffusion VLAs to real-time performance. It decouples auxiliary training objectives in parameter space by training a model to convergence on a small-scale task set under standard SFT and auxiliary strategies, interpreting the resulting parameter difference Δθ as capability vectors supplied by auxiliary tasks. These vectors are merged with pretrained parameters to create an enhanced meta-model. Standard SFT is then augmented with a lightweight orthogonal regularization loss so that the merged model matches the performance of auxiliary-finetuned baselines while incurring lower computational overhead. The abstract states that experimental results demonstrate effectiveness across diverse robot tasks.

Significance. If the parameter-difference construction reliably isolates transferable auxiliary capabilities, the method would offer a practical route to auxiliary-level gains without auxiliary-loss overhead during main training. This could meaningfully lower adaptation costs for discrete diffusion VLAs in robotics, where real-time constraints are binding. The approach is conceptually lightweight and avoids introducing new auxiliary objectives at inference or fine-tuning time.

major comments (2)

[Method (parameter-space decoupling and merging)] The load-bearing step (described in the method section on decoupling) treats Δθ = θ_aux − θ_SFT as a clean, additive capability vector. In non-convex diffusion training landscapes this difference can mix auxiliary-task gains with optimization artifacts (learning-rate schedules, stochasticity, early stopping). No analysis is supplied showing that the vector remains orthogonal to task-specific fitting or transfers reliably; without such evidence the subsequent merging and orthogonal-regularization claims rest on an unverified assumption.
[Experimental results] The abstract asserts that the merged model attains performance comparable to auxiliary baselines, yet the provided text supplies no quantitative metrics, baselines, ablation tables, or statistical tests. The central claim therefore cannot be evaluated from the manuscript as written; the experimental section must include these details for the performance-parity result to be verifiable.

minor comments (2)

[Method] Notation for the capability vector and the orthogonal regularization loss should be introduced with explicit equations rather than prose descriptions alone.
[Experimental setup] The project page link is given but no reproducibility artifacts (code, checkpoints, or exact hyper-parameters for the two training runs) are referenced in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Method (parameter-space decoupling and merging)] The load-bearing step (described in the method section on decoupling) treats Δθ = θ_aux − θ_SFT as a clean, additive capability vector. In non-convex diffusion training landscapes this difference can mix auxiliary-task gains with optimization artifacts (learning-rate schedules, stochasticity, early stopping). No analysis is supplied showing that the vector remains orthogonal to task-specific fitting or transfers reliably; without such evidence the subsequent merging and orthogonal-regularization claims rest on an unverified assumption.

Authors: We acknowledge that Δθ computed from two separate training runs on a small task set can in principle contain optimization artifacts in a non-convex landscape. However, the orthogonal regularization term we introduce during the subsequent SFT stage is explicitly designed to encourage the merged parameters to preserve the auxiliary-derived directions while fitting the target task. Our preliminary experiments indicate that the performance gains persist across different random seeds and learning-rate schedules, suggesting the dominant component of Δθ is transferable. In the revision we will add a dedicated analysis subsection that reports (i) cosine similarities between Δθ and per-task gradients, (ii) ablation results on multiple training seeds, and (iii) sensitivity to early-stopping criteria. These additions will provide direct evidence for the reliability of the capability-vector interpretation. revision: yes
Referee: [Experimental results] The abstract asserts that the merged model attains performance comparable to auxiliary baselines, yet the provided text supplies no quantitative metrics, baselines, ablation tables, or statistical tests. The central claim therefore cannot be evaluated from the manuscript as written; the experimental section must include these details for the performance-parity result to be verifiable.

Authors: We agree that the current manuscript text does not contain the quantitative tables, baselines, or statistical tests needed to substantiate the performance-parity claim. In the revised version we will expand the experimental section with (i) success-rate and convergence-step tables comparing the merged model against standard SFT and full auxiliary-objective baselines on all reported robot tasks, (ii) ablation tables isolating the contribution of the orthogonal regularization term, and (iii) mean and standard-deviation results over at least five independent runs with paired statistical significance tests. These additions will make the central empirical claims fully verifiable. revision: yes

Circularity Check

1 steps flagged

Parameter difference interpreted as capability vector by construction

specific steps

self definitional [Abstract]
"The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model."

The paper defines the observed parameter difference Δθ obtained from two training runs as the 'capability vectors provided by auxiliary tasks' by direct interpretation, then proceeds to merge these vectors and claim enhanced general capabilities. This makes the central decoupling step self-referential: the benefit is attributed to the vector precisely because it was labeled as such, without an independent derivation showing that the difference cleanly encodes auxiliary gains orthogonal to task-specific fitting.

full rationale

The paper's core construction trains two models on the same small-scale task set under distinct strategies, then directly interprets their parameter difference as auxiliary capability vectors that are merged into a meta-model. This interpretive step is self-definitional by the paper's own wording but remains a minor assumption rather than a load-bearing tautology that forces all downstream performance claims. No equations reduce to inputs by construction, no self-citations are invoked as uniqueness theorems, and the final performance parity is presented as an empirical outcome of the orthogonal regularization loss rather than a fitted prediction. The derivation is therefore largely self-contained against external benchmarks, warranting only a low circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the interpretation that parameter differences encode auxiliary capabilities and on the effectiveness of a simple orthogonal regularizer to preserve those capabilities during task-specific fine-tuning. No free parameters or invented physical entities are introduced.

axioms (1)

domain assumption The difference between parameters from two distinct training strategies on a small task set represents the capability enhancement supplied by auxiliary tasks.
This interpretation is required to justify the merging step that creates the meta-model.

invented entities (1)

capability vectors no independent evidence
purpose: To encode the general capability gains from auxiliary training in a form that can be merged into pretrained weights.
The vectors are defined directly from the observed parameter difference and have no independent falsifiable prediction outside the merging procedure.

pith-pipeline@v0.9.0 · 5551 in / 1301 out tokens · 57835 ms · 2026-05-15T00:24:26.295632+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decouple the two objectives of auxiliary task training within the parameter space... difference between the resulting model parameters can then be interpreted as capability vectors
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

block-wise diffusion with a corresponding attention pattern to allow KV cache reuse

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
cs.RO 2026-05 unverdicted novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,

work page internal anchor Pith review arXiv
[2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Unified diffusionvla: Vision-language-actionmodelviajointdiscretedenoisingdiffusionprocess.arXivpreprintarXiv:2511.01718,

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusionvla: Vision-language-actionmodelviajointdiscretedenoisingdiffusionprocess.arXivpreprintarXiv:2511.01718,

work page arXiv
[6]

arXiv preprint arXiv:2505.03912 (2025) 1 16 H

Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

work page arXiv
[7]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

work page 2019
[8]

Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

work page arXiv
[9]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

12 Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2509.12594 (2025)

Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, and Xianpeng Lang. The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning.arXiv preprint arXiv:2509.12594,

work page arXiv
[11]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025a. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816,

Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816,

work page arXiv
[13]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025a

Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu, Yuhao Zhang, Zanxin Chen, Tianshuo Yang, Yilun Chen, Jiangmiao Pang, et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025a. Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jian...

work page arXiv
[16]

Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.ArXiv, abs/2503.10631, 2025

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025b. Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for...

work page arXiv
[17]

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

work page internal anchor Pith review arXiv
[18]

Action-aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093,

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, and Chang Xu. Action-aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093,

work page arXiv
[19]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

13 Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies, 2025

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Lioutikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning. InThe Thirteenth International Conference on Learning Representations, 2025a. MoritzReuss, HongyiZhou, MarcelRühle, ÖmerErdinçYağmurlu, FabianOtto, andRudolfLioutikov. Flower: Democratizing ge...

work page arXiv
[22]

arXiv preprint arXiv:2506.13725 (2025)

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, and Haoang Li. Ceed-vla: Consistency vision-language-action model with early-exit decoding.arXiv preprint arXiv:2506.13725, 2025a. Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action ...

work page arXiv
[23]

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang

https://openreview.net/forum?id=t5uLZSRjhF. Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025b. Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, and Yi Xu. dvla: Di...

work page doi:10.1109/lra.2025.3544909 2025
[24]

Qvla: Not all channels are equal in vision-language-action model’s quantization.arXiv preprint arXiv:2602.03782,

Yuhao Xu, Yantai Yang, Zhenyang Fan, Yufan Liu, Yuming Li, Bing Li, and Zhipeng Zhang. Qvla: Not all channels are equal in vision-language-action model’s quantization.arXiv preprint arXiv:2602.03782,

work page arXiv
[25]

S-vam: Shortcut video-action model by self-distilling geometric and semantic foresight.arXiv preprint arXiv:2603.16195,

Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, et al. S-vam: Shortcut video-action model by self-distilling geometric and semantic foresight.arXiv preprint arXiv:2603.16195,

work page arXiv
[26]

Effi- cientvla: Training-free acceleration and compression for vision-language-action models.arXiv preprint arXiv:2506.10100,

14 Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Effi- cientvla: Training-free acceleration and compression for vision-language-action models.arXiv preprint arXiv:2506.10100,

work page arXiv
[27]

arXiv preprint arXiv:2512.22615 (2025) 3

Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, and Lingpeng Kong. Dream-vl&dream-vla: Openvision-languageandvision-language-actionmodelswithdiffusionlanguagemodelbackbone. arXiv preprint arXiv:2512.22615, 2025a. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Ko...

work page arXiv
[28]

Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759, 2025

Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759,

work page arXiv
[29]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024a. Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, J...

work page arXiv
[30]

Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Wenxuan Song, Jiayi Chen, and Haoang Li. Flowvla: Thinking in motion with a visual chain of thought.arXiv preprint arXiv:2508.18269,

work page arXiv
[31]

Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, et al. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,

work page arXiv
[32]

Overall, our method achieves competitive performance against recent strong VLA baselines, demonstrating that accelerating discrete diffusion VLAs does not compromise policy quality

96.8%98.8%95.8% 85.2% 94.2% π0.5 (Intelligence et al., 2025)98.8%98.2%98.0%92.4% 96.8% DDVLA (Liang et al., 2025b) 97.2% 98.6% 97.4% 92.0% 96.3% + ours 97.0% 98.8% 97.6% 92.8% 96.6% F Comparison with SOTA on LIBERO Table S3 presents the comparison with state-of-the-art methods on the LIBERO benchmark. Overall, our method achieves competitive performance a...

work page 2025