hub

Instructvla: Vision-language-action instruction tuning from understanding to manipulation

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang · 2025 · arXiv 2507.17520

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

TTT-VLA performs test-time training for VLA models by optimizing only a latent prompt on new interaction data via a proxy self-supervised signal, yielding higher task success rates on SimplerEnv in single- and multi-embodiment settings.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

cs.RO · 2026-03-10 · unverdicted · novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

Continuous Reasoning for Vision-Language-Action

cs.RO · 2026-05-29 · unverdicted · novelty 6.0

Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.

ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

cs.CR · 2026-05-27 · unverdicted · novelty 6.0

ReasonBreak demonstrates up to 89% attack success on reasoning and 72% on trajectories in NVIDIA Alpamayo VLA models via black-box textual perturbations, introducing a reasoning-aware evaluation framework and benchmark for autonomous driving.

VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

VLA-Hijack is a new adversarial patch attack on Vision-Language-Action models that suppresses real arm features and injects the patch as surrogate embodiment to achieve high cross-architecture transferability.

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

VL-DPO uses a VLM as a zero-shot reasoner to generate preference pairs from pretrained model rollouts, then finetunes via DPO on the Waymo Open End-to-End Driving Dataset, yielding 11.94% higher rater feedback score and 10.01% lower average displacement error.

Grounded World Model for Semantically Generalizable Planning

cs.RO · 2026-04-13 · conditional · novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

cs.AI · 2026-04-12 · unverdicted · novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

cs.CV · 2026-02-23 · unverdicted · novelty 6.0

Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

cs.RO · 2025-08-26 · conditional · novelty 6.0

MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.

QuoVLA: Quotient Space for Vision-Language-Action Models

cs.CV · 2026-05-24 · unverdicted · novelty 5.0

QuoVLA introduces a quotient-space framework that compresses VLM latents into action-sufficient representations via quantization and dual-branch design for better VLA generalization.

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

cs.RO · 2026-05-19 · unverdicted · novelty 5.0

RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.

VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

cs.RO · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

cs.RO · 2026-04-22 · unverdicted · novelty 5.0

PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.

Causal World Modeling for Robot Control

cs.CV · 2026-01-29 · unverdicted · novelty 5.0

LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

cs.RO · 2025-11-04 · unverdicted · novelty 5.0

XR-1 introduces Unified Vision-Motion Codes learned by dual-branch VQ-VAE and applies them in a three-stage training pipeline to outperform prior VLA models on 120+ real-world manipulation tasks across six robot embodiments.

citing papers explorer

Showing 17 of 17 citing papers.

TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models cs.RO · 2026-06-02 · unverdicted · none · ref 37
TTT-VLA performs test-time training for VLA models by optimizing only a latent prompt on new interaction data via a proxy self-supervised signal, yielding higher task success rates on SimplerEnv in single- and multi-embodiment settings.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV · 2026-04-27 · unverdicted · none · ref 49
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models cs.RO · 2026-03-10 · unverdicted · none · ref 44
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
Continuous Reasoning for Vision-Language-Action cs.RO · 2026-05-29 · unverdicted · none · ref 29
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving cs.CR · 2026-05-27 · unverdicted · none · ref 51
ReasonBreak demonstrates up to 89% attack success on reasoning and 72% on trajectories in NVIDIA Alpamayo VLA models via black-box textual perturbations, introducing a reasoning-aware evaluation framework and benchmark for autonomous driving.
VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking cs.CV · 2026-05-27 · unverdicted · none · ref 31
VLA-Hijack is a new adversarial patch attack on Vision-Language-Action models that suppresses real arm features and injects the patch as surrogate embodiment to achieve high cross-architecture transferability.
VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving cs.CV · 2026-05-19 · unverdicted · none · ref 12
VL-DPO uses a VLM as a zero-shot reasoner to generate preference pairs from pretrained model rollouts, then finetunes via DPO on the Waymo Open End-to-End Driving Dataset, yielding 11.94% higher rater feedback score and 10.01% lower average displacement error.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 60
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning cs.AI · 2026-04-12 · unverdicted · none · ref 22
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies cs.CV · 2026-02-23 · unverdicted · none · ref 45
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation cs.RO · 2025-08-26 · conditional · none · ref 28
MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.
QuoVLA: Quotient Space for Vision-Language-Action Models cs.CV · 2026-05-24 · unverdicted · none · ref 30
QuoVLA introduces a quotient-space framework that compresses VLM latents into action-sufficient representations via quantization and dual-branch design for better VLA generalization.
RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 23
RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 30 · 2 links
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance cs.RO · 2026-04-22 · unverdicted · none · ref 28
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
Causal World Modeling for Robot Control cs.CV · 2026-01-29 · unverdicted · none · ref 87
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations cs.RO · 2025-11-04 · unverdicted · none · ref 97
XR-1 introduces Unified Vision-Motion Codes learned by dual-branch VQ-VAE and applies them in a three-stage training pipeline to outperform prior VLA models on 120+ real-world manipulation tasks across six robot embodiments.

Instructvla: Vision-language-action instruction tuning from understanding to manipulation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer