ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
hub Canonical reference
Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511, 2025a
Canonical reference. 83% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.
GeoHAT reports a 79.3% mean success rate on the ManiSkill-HAB mobile manipulation benchmark (23.7% above the strongest baseline) by using gated geometric token injection and a hybrid whole-body action decoder.
World Pilot augments VLA policies with world-action priors through latent and action steering pathways, reporting 84.7% success on LIBERO-Plus zero-shot OOD and top real-robot results across four tasks.
OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.
The paper quantifies the geometric gap in current VLAs via linear probing and compares three architectures for injecting geometry from GFMs while analyzing impacts of data, cameras, and reconstruction quality.