StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
Title resolution pending
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
EventTrack6D tracks 6D poses of unseen objects from event cameras by reconstructing dense intensity and depth cues between frames, generalizing from synthetic training to real data at high speed.
EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
An open-vocabulary pipeline anchors functional edges via 2D visual grounding then uses temporal 3D graph optimization with evidence accumulation and entropy regularization to build hierarchical scene graphs for dense indoor scenes.
EgoForce recovers absolute camera-space 3D hand pose from monocular egocentric images using forearm guidance, a unified arm-hand transformer, and a closed-form ray-space solver that handles fisheye, perspective, and wide-FOV cameras.
DexSynRefine couples HOI motion manifold flow primitives with task-space residual RL and proprioceptive adaptation to convert human-object interaction data into executable dexterous robot motions, reporting 50-70 point real-world success rate gains over kinematic retargeting on five tasks.
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.
HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on the TASTE-Rob dataset.
EgoVLA pretrains VLA models on egocentric human videos, retargets predicted actions to robots via IK, and fine-tunes on few robot demos to improve bimanual manipulation performance on a new simulation benchmark.
HumanEgo reports 92.5% average success on four real robot tasks using only 15-30 minutes of human video per task and zero robot data, with zero-shot transfer to new robots and cameras.
VGGT-Ω improves feed-forward reconstruction accuracy and efficiency by architectural simplifications, register-based attention, and training on much larger supervised and unlabeled video data.
citing papers explorer
-
EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
EgoForce recovers absolute camera-space 3D hand pose from monocular egocentric images using forearm guidance, a unified arm-hand transformer, and a closed-form ray-space solver that handles fisheye, perspective, and wide-FOV cameras.
-
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on the TASTE-Rob dataset.