Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.
hub Canonical reference
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA
hub tools
citation-role summary
citation-polarity summary
years
2026 21verdicts
UNVERDICTED 21polarities
background 12representative citing papers
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.
GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.
A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.
MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.
LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-distribution robustness.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstrained environments.
citing papers explorer
-
Dexora: Open-source VLA for High-DoF Bimanual Dexterity
Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.
-
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
-
EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices
EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
-
SCAR: Self-Supervised Continuous Action Representation Learning
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
-
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
-
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
-
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.
-
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
-
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.
-
Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum
A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.
-
LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment
LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-distribution robustness.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstrained environments.