Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.
hub Canonical reference
A Pragmatic VLA Foundation Model
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
hub tools
citation-role summary
citation-polarity summary
years
2026 46representative citing papers
MuseVLA adds on-demand sensor selection via tokens and converts readings into grounded sensor images for multimodal fusion, reporting 80.6% average success on real-robot dexterous tasks that need non-visual sensing.
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
RLDT fine-tunes pretrained flow-matching policies for continuous control by aligning them to a max-entropy RL transport field constructed via SVGD, using expected-target estimation for stable multi-step updates.
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.
ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.
An automatic single-demo VLM trajectory labelling pipeline enables keypose-guided diffusion policies that match baseline performance and show preliminary benefits for cross-embodiment transfer on robomimic tasks.
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.
Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
ACE-Ego-0 is a VLA pretraining framework that turns egocentric human videos into robot-format pseudo-actions via a video-to-action pipeline and trains jointly with robot data under a reliability-aware objective.
MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
FineVLA unifies robot datasets into 47k fine-grained trajectories, adds a VLM annotator and benchmark, and shows that mixing fine-grained and goal-level instructions improves steerable control without hurting task success.
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
citing papers explorer
-
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.