EgoTouch is a new multi-view egocentric dataset with dense bimanual tactile supervision, and TouchAnything is a baseline framework showing that wrist views improve vision-based tactile prediction over egocentric input alone.
hub
Egodex: Learning dexterous manipulation from large-scale egocentric video
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 1polarities
background 1representative citing papers
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from single human demonstrations without paired data.
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.
TAMEn supplies a cross-morphology wearable interface and pyramid-structured visuo-tactile data regime that raises bimanual manipulation success rates from 34% to 75% via closed-loop collection.
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising real-world success from 42.5% to 70.8%.
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
A position paper proposes decomposing affective robotic touch into multiple specialized deep learning models for better social human-robot interaction.
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.