pith. sign in

arxiv: 2606.05880 · v1 · pith:GQE72WJKnew · submitted 2026-06-04 · 💻 cs.RO

TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

Pith reviewed 2026-06-28 01:16 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid locomotionactive gazereinforcement learningterrain perceptionattention mechanismperceptive controlfoothold selectionagile locomotion
0
0 comments X

The pith

A terrain-aware active gaze framework lets reinforcement learning produce selective attention that enables 1.2 meter real-world gap crossings for humanoid robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that humanoid locomotion policies can learn to actively focus on the most informative parts of terrain height scans by fusing vision with proprioception and commands inside an attention mechanism. This selective focus raises the information density of each observation under onboard compute limits and supports anticipatory foothold choices over large or difficult surfaces. A sympathetic reader would care because it removes the need for hand-designed gaze rules or extra labels while still delivering hardware results that include reliable traversal of elevated platforms, sparse footholds, and gaps up to 1.2 meters. The claim is that these gaze patterns arise naturally from reinforcement learning alone and directly improve both training speed and final generalization.

Core claim

By fusing vision, proprioception, and motion commands, the attention-based controller learns to attend to specific regions of the height scan and uses those regions for downstream action selection. Gaze behaviors emerge through reinforcement learning without additional supervision or explicit guidance, raising observation efficiency and allowing fine-grained perceptive locomotion across larger terrains. The resulting policy produces robust simulation-to-hardware transfer that includes terrain-aware foothold selection, elevated-platform crossing, competitive sparse-foothold performance, and the largest reported real-world gap traversal of 1.2 meters while remaining stable under perceptual dis

What carries the argument

The Terrain-aware Active Gaze (TAGA) attention module that adaptively selects informative height-scan patches before they reach the policy network.

If this is right

  • The policy performs reliable terrain-aware foothold selection across varied surfaces.
  • It traverses elevated platforms without loss of balance.
  • It matches or exceeds prior methods on sparse-foothold sequences.
  • It achieves 1.2 meter gap crossings in hardware, the largest reported for perceptive humanoids.
  • It keeps balance under severe perceptual noise and environmental interference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention emergence might reduce the need for separate perception pipelines in other sensor-rich control tasks.
  • Policies trained this way could generalize to new robot morphologies with only retraining of the attention weights.
  • If the pattern holds, future work could test whether similar unsupervised attention appears in non-locomotion settings such as arm manipulation over cluttered tables.

Load-bearing premise

Gaze behaviors will emerge from ordinary reinforcement learning without any extra supervision and that this emergence is what produces the reported hardware performance gains.

What would settle it

An ablation that removes the learned attention module or forces uniform sampling of the height scan and then measures whether the policy still reaches 1.2 meter real-world gap crossings with comparable stability.

Figures

Figures reproduced from arXiv: 2606.05880 by Fangzhou Xu, Guillaume Sartoretti, Hongtao Wang, Hongyi Li, Mingfeng Fan, Peizhuo Li, Shuhao Liao, Yongbin Jin, Yuhong Cao, Yuxuan Ma, Ze Wang, Zicheng Zeng.

Figure 1
Figure 1. Figure 1: TAGA enables agile and robust humanoid locomotion across diverse challenging terrains. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between local height scan and depth image perception. Existing perceptive locomotion methods can generally be divided into two categories: mapping-based methods and vision-based methods. Mapping-based approaches use point clouds or reconstructed height scans as compact ter￾rain representations for locomotion [8, 9, 10, 11]. While effective, these methods often incur increasing computa￾tional cos… view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of TAGA. 3.2 Neural Network Design Overview. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the learned active gaze regions and attention-weight distributions of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robot stance without AMP. Without motion guidance, the robot constantly walk with its leg bent. Motion Prior (Q3). Removing the AMP priors does not substantially hurt task completion, with TAGA-NoAMP performing close to TAGA across all terrains ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-world evaluation of TAGA from controlled indoor terrains to unstructured outdoor [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training Terrains for TAGA To improve traversability across diverse terrain conditions, we construct a broad set of representa￾tive terrain types during training, including ascending and descending stairs, gaps, stepping stones (sparse footholds), box obstacles, elevated platforms, and sloped surfaces, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TAGA, a Terrain-aware Active Gaze learning framework for attention-based humanoid locomotion. By fusing vision, proprioception, and motion commands, the framework is claimed to enable the model to learn anticipatory cues and actively attend to informative regions of the height scan. The authors state that such gaze behaviors emerge naturally through reinforcement learning without additional supervision or explicit guidance, improving training efficiency and yielding robust, generalizable policies. These policies are reported to achieve reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold performance, and the largest reported real-world gap traversal of 1.2 m among perceptive humanoid systems, while remaining stable under perceptual disturbances.

Significance. If the central claims hold after addressing the noted gaps, the work would be significant for perceptive humanoid locomotion by providing evidence that active gaze can emerge unsupervised in RL pipelines, potentially improving observation efficiency under onboard compute limits and enabling larger-scale terrain traversal. The reported hardware gap distance would represent a concrete benchmark advance if supported by detailed, reproducible methods and controls.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results): The headline claims that gaze behaviors 'naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance' and directly cause the reported training-efficiency gains and hardware robustness (including 1.2 m gap traversal) are load-bearing for the central contribution, yet no ablation studies are described that disable or randomize the attention-guidance component while holding the rest of the RL pipeline fixed. Without such controls, it is impossible to isolate whether performance stems from emergent gaze or from other elements such as proprioceptive fusion, reward shaping, or curriculum design.
  2. [Abstract] Abstract: All quantitative performance claims (terrain-aware footholds, 1.2 m gap traversal, stability under perceptual noise) are stated without accompanying metrics, error bars, training curves, or ablation tables. This absence prevents assessment of effect sizes and statistical reliability of the sim-to-real transfer attributed to the gaze mechanism.
minor comments (1)
  1. [Abstract] The abstract refers to 'the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems' without citing the specific prior works used for comparison; a table or explicit references in the main text would clarify the baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: The headline claims that gaze behaviors 'naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance' and directly cause the reported training-efficiency gains and hardware robustness (including 1.2 m gap traversal) are load-bearing for the central contribution, yet no ablation studies are described that disable or randomize the attention-guidance component while holding the rest of the RL pipeline fixed. Without such controls, it is impossible to isolate whether performance stems from emergent gaze or from other elements such as proprioceptive fusion, reward shaping, or curriculum design.

    Authors: We agree that dedicated ablation studies isolating the attention-guidance component are needed to rigorously support the emergence claim. In the revised manuscript, we will add experiments that disable or randomize the active gaze selection while holding the RL pipeline, rewards, proprioception, and curriculum fixed, to quantify its specific contribution to training efficiency and robustness. revision: yes

  2. Referee: All quantitative performance claims (terrain-aware footholds, 1.2 m gap traversal, stability under perceptual noise) are stated without accompanying metrics, error bars, training curves, or ablation tables. This absence prevents assessment of effect sizes and statistical reliability of the sim-to-real transfer attributed to the gaze mechanism.

    Authors: We will revise the abstract to report key quantitative metrics with error bars and statistical details. We will also expand §4 to include comprehensive training curves, ablation tables, and effect-size metrics supporting the performance claims and sim-to-real transfer. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical RL training outcomes

full rationale

The paper presents its core results as outcomes of reinforcement learning training on a terrain-aware active gaze framework. No equations, fitted parameters, or self-citations are shown in the provided text that reduce the reported performance gains (e.g., 1.2m gap traversal) to definitional inputs or prior self-referential results by construction. The statement that gaze behaviors 'naturally emerge through reinforcement learning alone' is framed as an empirical observation rather than a self-definitional or fitted prediction. The derivation chain is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit modeling choices or new postulated quantities are described.

pith-pipeline@v0.9.1-grok · 5765 in / 994 out tokens · 11622 ms · 2026-06-28T01:16:16.172812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 1 canonical work pages

  1. [1]

    Sombolestan and Q

    M. Sombolestan and Q. Nguyen. Adaptive-force-based control of dynamic legged locomotion over uneven terrain.IEEE Transactions on Robotics, 40:2462–2477, 2024

  2. [2]

    Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

  3. [3]

    Zhang, Y

    Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

  4. [4]

    N. Fey, G. B. Margolis, M. Peticco, and P. Agrawal. Bridging the sim-to-real gap for athletic loco-manipulation.arXiv preprint arXiv:2502.10894, 2025

  5. [5]

    Murooka, K

    M. Murooka, K. Chappellet, A. Tanguy, M. Benallegue, I. Kumagai, M. Morisawa, F. Kane- hiro, and A. Kheddar. Humanoid loco-manipulations pattern generation and stabilization con- trol.IEEE Robotics and Automation Letters, 6(3):5597–5604, 2021

  6. [6]

    Bouyarmane, K

    K. Bouyarmane, K. Chappellet, J. Vaillant, and A. Kheddar. Quadratic programming for mul- tirobot and task-space force control.IEEE Transactions on Robotics, 35(1):64–77, 2018

  7. [7]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

  8. [8]

    J. He, C. Zhang, F. Jenelten, R. Grandia, M. B ¨acher, and M. Hutter. Attention-based map encoding for learning generalized legged locomotion.Science Robotics, 10(105):eadv3604, 2025

  9. [9]

    T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822, 2022

  10. [10]

    J. Long, J. Ren, M. Shi, Z. Wang, T. Huang, P. Luo, and J. Pang. Learning humanoid locomo- tion with perceptive internal model. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9997–10003. IEEE, 2025

  11. [11]

    Hoeller, N

    D. Hoeller, N. Rudin, D. Sako, and M. Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

  12. [12]

    Cheng, K

    X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450. IEEE, 2024

  13. [13]

    Agarwal, A

    A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on robot learning, pages 403–415. PMLR, 2023

  14. [14]

    H. Song, H. Zhu, T. Yu, Y . Liu, M. Yuan, W. Zhou, H. Chen, and H. Li. Gait-adaptive per- ceptive humanoid locomotion with real-time under-base terrain reconstruction.IEEE Robotics and Automation Letters, 2026

  15. [15]

    R. Yang, M. Zhang, N. Hansen, H. Xu, and X. Wang. Learning vision-guided quadrupedal lo- comotion end-to-end with cross-modal transformers. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=kFdPX1VdgXx

  16. [16]

    Zhuang, S

    Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. In8th Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=fs7ia3FqUM

  17. [17]

    Zhang, V

    C. Zhang, V . Klemm, F. Yang, and M. Hutter. Ame-2: Agile and generalized legged locomotion via attention-based neural map encoding.arXiv preprint arXiv:2601.08485, 2026. 9

  18. [18]

    Fankhauser, M

    P. Fankhauser, M. Bjelonic, C. D. Bellicoso, T. Miki, and M. Hutter. Robust rough-terrain locomotion with a quadrupedal robot. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5761–5768. IEEE, 2018

  19. [19]

    Jenelten, T

    F. Jenelten, T. Miki, A. E. Vijayan, M. Bjelonic, and M. Hutter. Perceptive locomotion in rough terrain–online foothold optimization.IEEE Robotics and Automation Letters, 5(4):5370–5376, 2020

  20. [20]

    Z. Wang, Y . Li, L. Xu, H. Shi, Z. Ma, Z. Chu, C. Li, F. Gao, K. Yang, and K. Wang. Sf-tim: A simple framework for enhancing quadrupedal robot jumping agility by combining terrain imagination and measurement. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10676–10683. IEEE, 2025

  21. [21]

    T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, and M. Hutter. Elevation map- ping for locomotion and navigation using gpu. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2273–2280. IEEE, 2022

  22. [22]

    Y . Dong, J. Ma, L. Zhao, W. Li, and P. Lu. Marg: Mastering risky gap terrains for legged robots with elevation mapping.IEEE Transactions on Robotics, 2025

  23. [23]

    Zhang, N

    C. Zhang, N. Rudin, D. Hoeller, and M. Hutter. Learning agile locomotion on risky terrains. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11864–11871. IEEE, 2024

  24. [24]

    Y . Chen, J. Ma, Z. Luo, Y . Han, Y . Dong, B. Xu, and P. Lu. Learning autonomous and safe quadruped traversal of complex terrains using multi-layer elevation maps.IEEE Robotics and Automation Letters, 2025

  25. [25]

    T. Miki, J. Lee, L. Wellhausen, and M. Hutter. Learning to walk in confined spaces using 3d representation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8649–8656. IEEE, 2024

  26. [26]

    Fankhauser, M

    P. Fankhauser, M. Bloesch, and M. Hutter. Probabilistic terrain mapping for mobile robots with uncertain localization.IEEE Robotics and Automation Letters, 3(4):3019–3026, 2018

  27. [27]

    W. Yu, D. Jain, A. Escontrela, A. Iscen, P. Xu, E. Coumans, S. Ha, J. Tan, and T. Zhang. Visual- locomotion: Learning to walk on complex terrains with vision. In5th Annual Conference on Robot Learning, 2021

  28. [28]

    H. Duan, B. Pandit, M. S. Gadde, B. Van Marum, J. Dao, C. Kim, and A. Fern. Learning vision- based bipedal locomotion for challenging terrain. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 56–62. IEEE, 2024

  29. [29]

    H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang. Beamdojo: Learning agile humanoid locomotion on sparse footholds.arXiv preprint arXiv:2502.10363, 2025

  30. [30]

    Rudin, J

    N. Rudin, J. He, J. Aurand, and M. Hutter. Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning.arXiv preprint arXiv:2505.11164, 2025

  31. [31]

    J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction.arXiv preprint arXiv:2510.07152, 2025

  32. [32]

    Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025. 10

  33. [33]

    S. Li, S. Luo, J. Wu, and Q. Zhu. Move: Multi-skill omnidirectional legged locomotion with limited view in 3d environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7647–7653. IEEE, 2025

  34. [34]

    P. Li, H. Li, Y . Ma, L. Chang, X. Yang, R. Yu, Y . Zhang, Y . Cao, Q. Zhu, and G. Sartoretti. Kivi: Kinesthetic-visuospatial integration for dynamic and safe egocentric legged locomotion. arXiv preprint arXiv:2509.23650, 2025

  35. [35]

    S. Luo, S. Li, R. Yu, Z. Wang, J. Wu, and Q. Zhu. Pie: Parkour with implicit-explicit learning framework for legged robots.IEEE Robotics and Automation Letters, 9(11):9986–9993, 2024

  36. [36]

    R. Yang, G. Yang, and X. Wang. Neural volumetric memory for visual locomotion control. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1430–1440, 2023

  37. [37]

    H. Lai, J. Cao, J. Xu, H. Wu, Y . Lin, T. Kong, Y . Yu, and W. Zhang. World model-based perception for visual legged locomotion. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11531–11537. IEEE, 2025

  38. [38]

    Zhang, J

    C. Zhang, J. Jin, J. Frey, N. Rudin, M. Mattamala, C. Cadena, and M. Hutter. Resilient legged local navigation: Learning to traverse with compromised perception end-to-end. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 34–41. IEEE, 2024

  39. [39]

    Hoeller, N

    D. Hoeller, N. Rudin, C. Choy, A. Anandkumar, and M. Hutter. Neural scene representation for locomotion on structured terrain.IEEE Robotics and Automation Letters, 7(4):8667–8674, 2022

  40. [40]

    R. Yu, Q. Wang, H. Li, Z. Jun, Z. Wang, J. Wu, and Q. Zhu. Start: Traversing sparse footholds with terrain reconstruction.IEEE Robotics and Automation Letters, 11(2):2194–2201, 2025

  41. [41]

    F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter. Spatially-enhanced recur- rent memory for long-range mapless navigation via end-to-end reinforcement learning.The International Journal of Robotics Research, page 02783649251401926, 2025

  42. [42]

    A. Reed, B. Crowe, D. Albin, L. Achey, B. Hayes, and C. Heckman. Scenesense: Diffusion models for 3d occupancy synthesis from partial observation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7383–7390. IEEE, 2024

  43. [43]

    S. Shao, T. Huang, W. Gao, and S. Zhang. Adapt: Adaptive dual-projection architecture for perceptive traversal.arXiv preprint arXiv:2603.16328, 2026

  44. [44]

    Singh, Y

    K. Singh, Y . Kim, Y . Turkar, and K. Dantu. Cart: Context-aware terrain adaptation using temporal sequence selection for legged robots.arXiv preprint arXiv:2604.14344, 2026

  45. [45]

    S. Ma, H. Chen, Z. Xu, Y . Zhao, K. Wu, R. Yang, L. Zou, Z. Gan, and W. Ding. Cmoe: Contrastive mixture of experts for motion control and terrain adaptation of humanoid robots. arXiv preprint arXiv:2603.03067, 2026

  46. [46]

    Schwarke, M

    C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

  47. [47]

    Mittal, N

    M. Mittal, N. Rudin, V . Klemm, A. Allshire, and M. Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

  48. [48]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20,

  49. [49]

    doi:10.1145/3450626.3459670

    ISSN 1557-7368. doi:10.1145/3450626.3459670. URLhttp://dx.doi.org/10. 1145/3450626.3459670. 11

  50. [50]

    Mittal, P

    M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

  51. [51]

    Y . Hao, R. Yu, S. Luo, G. Zhang, J. Wu, and Q. Zhu. Cref: Cross-modal and recurrent fusion for depth-conditioned humanoid locomotion.arXiv preprint arXiv:2603.29452, 2026

  52. [52]

    W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, B. Cao, Y . Liu, et al. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026

  53. [53]

    D. Wang, X. Wang, X. Liu, J. Shi, Y . Zhao, C. Bai, and X. Li. More: Mixture of residual experts for humanoid lifelike gaits learning on complex terrains.arXiv preprint arXiv:2506.08840, 2025

  54. [54]

    S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

  55. [55]

    Zhang, Y

    Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

  56. [56]

    Mahmood, N

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, Oct. 2019. 12 A The Details of POMDP We formulate humanoid perceptive locomotion as a partially observable Markov decision process (POMDP), denoted as a 6-tupleM=⟨S,O,A,P,R...