JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

· 2026 · cs.RO · arXiv 2604.20100

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

representative citing papers

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.

Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

cs.RO · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.

PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking

cs.RO · 2026-05-30 · unverdicted · novelty 6.0

PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

cs.RO · 2026-05-29 · unverdicted · novelty 5.0

DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.

JoyAI-Sim: A Simulation-Enabled Interconversion Toolchain for the Embodied Data Pyramid

cs.RO · 2026-06-15 · unverdicted · novelty 3.0

JoyAI-Sim provides bidirectional Robot-Simulation-Human pathways for aligned model evaluation and data generation in robotics using the JoySim simulator as an evaluation layer and physical consistency filter.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision cs.RO · 2026-06-29 · unverdicted · none · ref 67 · 2 links · internal anchor
ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking cs.RO · 2026-05-30 · unverdicted · none · ref 28 · internal anchor
PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation cs.RO · 2026-05-29 · unverdicted · none · ref 36 · internal anchor
DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.
JoyAI-Sim: A Simulation-Enabled Interconversion Toolchain for the Embodied Data Pyramid cs.RO · 2026-06-15 · unverdicted · none · ref 51 · internal anchor
JoyAI-Sim provides bidirectional Robot-Simulation-Human pathways for aligned model evaluation and data generation in robotics using the JoySim simulator as an evaluation layer and physical consistency filter.

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

fields

years

verdicts

representative citing papers

citing papers explorer