FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
hub Canonical reference
In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 16representative citing papers
SFHand presents the first streaming language-guided autoregressive framework for 3D hand forecasting, achieving up to 35.8% gains over prior methods and 13.4% better downstream embodied task performance.
Human-as-Humanoid converts ego-exo human videos into executable 60-DoF humanoid actions through embodiment alignment and retargeting, enabling zero-shot real-robot policy deployment without target-task teleoperation data.
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
GLAM learns a shared latent action space grounded in consistent future observation prediction across heterogeneous data sources to train improved behavioral cloning policies for robot manipulation tasks.
ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.
ConTrack introduces a constrained RL method with online dual-variable adaptation and adaptive resets for improved long-horizon hand tracking in simulation and on real robots.
The paper surveys four classes of techniques that derive action-related supervision from human videos for VLA robot models and identifies three open challenges in episode structuring, embodiment grounding, and evaluation.
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.
citing papers explorer
-
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
-
SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting
SFHand presents the first streaming language-guided autoregressive framework for 3D hand forecasting, achieving up to 35.8% gains over prior methods and 13.4% better downstream embodied task performance.
-
Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments
Human-as-Humanoid converts ego-exo human videos into executable 60-DoF humanoid actions through embodiment alignment and retargeting, enabling zero-shot real-robot policy deployment without target-task teleoperation data.
-
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
-
Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models
GLAM learns a shared latent action space grounded in consistent future observation prediction across heterogeneous data sources to train improved behavioral cloning policies for robot manipulation tasks.
-
ActiveMimic: Egocentric Video Pretraining with Active Perception
ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.
-
LACE: Latent Visual Representation for Cross-Embodiment Learning
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
-
SCAR: Self-Supervised Continuous Action Representation Learning
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
-
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.
-
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.
-
ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control
ConTrack introduces a constrained RL method with online dual-variable adaptation and adaptive resets for improved long-horizon hand tracking in simulation and on real robots.
-
From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data
The paper surveys four classes of techniques that derive action-related supervision from human videos for VLA robot models and identifies three open challenges in episode structuring, embodiment grounding, and evaluation.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.