Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al · 2025 · arXiv 2507.23682

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Learning Additively Compositional Latent Actions for Embodied AI

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

AC-LAM enforces additive composition on latent actions from visual transitions, yielding more structured and calibrated motion latents that improve downstream embodied policy learning over prior LAMs.

CUBic: Coordinated Unified Bimanual Perception and Control Framework

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordination accuracy and task success on the RoboTwin benchmark.

Reinforcing VLAs in Task-Agnostic World Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

cs.RO · 2026-05-08 · unverdicted · novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

cs.RO · 2026-04-21 · unverdicted · novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

cs.RO · 2026-04-14 · unverdicted · novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

cs.RO · 2026-04-06 · unverdicted · novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

cs.RO · 2026-05-07 · unverdicted · novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

RLDX-1 Technical Report

cs.RO · 2026-05-05 · unverdicted · novelty 4.0 · 2 refs

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

citing papers explorer

Showing 16 of 16 citing papers.

RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 32
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 84
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Learning Additively Compositional Latent Actions for Embodied AI cs.CV · 2026-04-03 · unverdicted · none · ref 1
AC-LAM enforces additive composition on latent actions from visual transitions, yielding more structured and calibrated motion latents that improve downstream embodied policy learning over prior LAMs.
CUBic: Coordinated Unified Bimanual Perception and Control Framework cs.RO · 2026-05-13 · unverdicted · none · ref 4
CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordination accuracy and task success on the RoboTwin benchmark.
Reinforcing VLAs in Task-Agnostic World Models cs.AI · 2026-05-12 · unverdicted · none · ref 6
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation cs.RO · 2026-05-11 · unverdicted · none · ref 8
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 13 · 2 links
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 90
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models cs.RO · 2026-05-06 · unverdicted · none · ref 15
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
GazeVLA: Learning Human Intention for Robotic Manipulation cs.RO · 2026-04-24 · unverdicted · none · ref 15
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling cs.RO · 2026-04-21 · unverdicted · none · ref 23
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models cs.RO · 2026-04-14 · unverdicted · none · ref 14
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? cs.RO · 2026-04-06 · unverdicted · none · ref 8
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 4
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 89
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 25 · 2 links
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer