Perceptive BFM grounds human motion priors in robot terrain perception via terrain-conformal reference synthesis and teacher-student transfer from adapted to raw-reference tracking.
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion. For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific reward shaping integrated with multi-critic and multi-discriminator learning, where dedicated networks capture the distinct dynamics and motion priors of each terrain type. We validate our approach on two humanoid platforms equipped with different stereo depth cameras. The resulting policy demonstrates robust performance across diverse environments, seamlessly handling extreme challenges such as high platforms and wide gaps, as well as fine-grained tasks including bidirectional long-term staircase traversal.
fields
cs.RO 6years
2026 6verdicts
UNVERDICTED 6representative citing papers
TAGA learns terrain-aware active gaze behaviors for humanoid robots via RL alone, enabling generalizable locomotion with 1.2m real-world gap traversal.
GuideWalk unifies traversability-aware navigation and terrain-adaptive locomotion into a single policy for humanoid robots via teacher distillation and RL refinement.
VAIC distills a teacher policy into a vision-and-proprioception student policy using recurrent adaptation and decoupled commands, enabling diverse real-robot tasks like box carrying and skateboarding that outperform baselines.
SSR is an end-to-end vision-based framework for humanoid traversal that learns imagined foothold guidance, equivariant latent-space symmetry augmentation, and terrain-specific multi-discriminator motion priors to enable safe locomotion on diverse real-world terrains.
A multi-channel terrain affordance reward combined with lower-body compliance training via virtual wrenches enables end-to-end PPO-trained humanoid policies to walk at 1 m/s on 0.2 m risers with improved payload robustness.
citing papers explorer
-
Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain
Perceptive BFM grounds human motion priors in robot terrain perception via terrain-conformal reference synthesis and teacher-student transfer from adapted to raw-reference tracking.
-
TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion
TAGA learns terrain-aware active gaze behaviors for humanoid robots via RL alone, enabling generalizable locomotion with 1.2m real-world gap traversal.
-
GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains
GuideWalk unifies traversability-aware navigation and terrain-adaptive locomotion into a single policy for humanoid robots via teacher distillation and RL refinement.
-
VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands
VAIC distills a teacher policy into a vision-and-proprioception student policy using recurrent adaptation and decoupled commands, enabling diverse real-robot tasks like box carrying and skateboarding that outperform baselines.
-
SSR: Scaling Surefooted and Symmetric Humanoid Traversal to the Open World
SSR is an end-to-end vision-based framework for humanoid traversal that learns imagined foothold guidance, equivariant latent-space symmetry augmentation, and terrain-specific multi-discriminator motion priors to enable safe locomotion on diverse real-world terrains.
-
TACT-ful: Multi-Channel Terrain Affordance and Compliance Training for Payload-Robust Perceptive Humanoid Locomotion
A multi-channel terrain affordance reward combined with lower-body compliance training via virtual wrenches enables end-to-end PPO-trained humanoid policies to walk at 1 m/s on 0.2 m risers with improved payload robustness.