TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

Fangzhou Xu; Guillaume Sartoretti; Hongtao Wang; Hongyi Li; Mingfeng Fan; Peizhuo Li; Shuhao Liao; Yongbin Jin; Yuhong Cao; Yuxuan Ma

arxiv: 2606.05880 · v1 · pith:GQE72WJKnew · submitted 2026-06-04 · 💻 cs.RO

TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

Peizhuo Li , Hongyi Li , Mingfeng Fan , Fangzhou Xu , Shuhao Liao , Yuxuan Ma , Zicheng Zeng , Ze Wang

show 4 more authors

Yongbin Jin Yuhong Cao Hongtao Wang Guillaume Sartoretti

This is my paper

Pith reviewed 2026-06-28 01:16 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid locomotionactive gazereinforcement learningterrain perceptionattention mechanismperceptive controlfoothold selectionagile locomotion

0 comments

The pith

A terrain-aware active gaze framework lets reinforcement learning produce selective attention that enables 1.2 meter real-world gap crossings for humanoid robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that humanoid locomotion policies can learn to actively focus on the most informative parts of terrain height scans by fusing vision with proprioception and commands inside an attention mechanism. This selective focus raises the information density of each observation under onboard compute limits and supports anticipatory foothold choices over large or difficult surfaces. A sympathetic reader would care because it removes the need for hand-designed gaze rules or extra labels while still delivering hardware results that include reliable traversal of elevated platforms, sparse footholds, and gaps up to 1.2 meters. The claim is that these gaze patterns arise naturally from reinforcement learning alone and directly improve both training speed and final generalization.

Core claim

By fusing vision, proprioception, and motion commands, the attention-based controller learns to attend to specific regions of the height scan and uses those regions for downstream action selection. Gaze behaviors emerge through reinforcement learning without additional supervision or explicit guidance, raising observation efficiency and allowing fine-grained perceptive locomotion across larger terrains. The resulting policy produces robust simulation-to-hardware transfer that includes terrain-aware foothold selection, elevated-platform crossing, competitive sparse-foothold performance, and the largest reported real-world gap traversal of 1.2 meters while remaining stable under perceptual dis

What carries the argument

The Terrain-aware Active Gaze (TAGA) attention module that adaptively selects informative height-scan patches before they reach the policy network.

If this is right

The policy performs reliable terrain-aware foothold selection across varied surfaces.
It traverses elevated platforms without loss of balance.
It matches or exceeds prior methods on sparse-foothold sequences.
It achieves 1.2 meter gap crossings in hardware, the largest reported for perceptive humanoids.
It keeps balance under severe perceptual noise and environmental interference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention emergence might reduce the need for separate perception pipelines in other sensor-rich control tasks.
Policies trained this way could generalize to new robot morphologies with only retraining of the attention weights.
If the pattern holds, future work could test whether similar unsupervised attention appears in non-locomotion settings such as arm manipulation over cluttered tables.

Load-bearing premise

Gaze behaviors will emerge from ordinary reinforcement learning without any extra supervision and that this emergence is what produces the reported hardware performance gains.

What would settle it

An ablation that removes the learned attention module or forces uniform sampling of the height scan and then measures whether the policy still reaches 1.2 meter real-world gap crossings with comparable stability.

Figures

Figures reproduced from arXiv: 2606.05880 by Fangzhou Xu, Guillaume Sartoretti, Hongtao Wang, Hongyi Li, Mingfeng Fan, Peizhuo Li, Shuhao Liao, Yongbin Jin, Yuhong Cao, Yuxuan Ma, Ze Wang, Zicheng Zeng.

**Figure 2.** Figure 2: Comparison between local height scan and depth image perception. Existing perceptive locomotion methods can generally be divided into two categories: mapping-based methods and vision-based methods. Mapping-based approaches use point clouds or reconstructed height scans as compact terrain representations for locomotion [8, 9, 10, 11]. While effective, these methods often incur increasing computational cos… view at source ↗

**Figure 3.** Figure 3: The architecture of TAGA. 3.2 Neural Network Design Overview. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the learned active gaze regions and attention-weight distributions of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Robot stance without AMP. Without motion guidance, the robot constantly walk with its leg bent. Motion Prior (Q3). Removing the AMP priors does not substantially hurt task completion, with TAGA-NoAMP performing close to TAGA across all terrains ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world evaluation of TAGA from controlled indoor terrains to unstructured outdoor [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Training Terrains for TAGA To improve traversability across diverse terrain conditions, we construct a broad set of representative terrain types during training, including ascending and descending stairs, gaps, stepping stones (sparse footholds), box obstacles, elevated platforms, and sloped surfaces, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAGA gets hardware results on 1.2m gaps with RL-trained gaze attention, but the causal link from unsupervised emergence to those gains rests on untested assumptions.

read the letter

The main takeaway is that this work trains a humanoid policy where active gaze on terrain scans comes out of standard RL, and the resulting controller handles 1.2 meter gaps on hardware along with other tricky terrain. That 1.2m number is the biggest they report for perceptive humanoids.

The new piece is framing the attention to height maps as something that emerges without hand-crafted gaze supervision, and they tie it to better training speed and robustness under perceptual noise. The sim-to-real transfer on elevated platforms and sparse footholds looks like a practical step forward for these systems.

Where it gets thin is the causal claim. The description mentions guiding the model to attend to specific areas, so the architecture already shapes what gets looked at. To pin the performance on the emergent gaze, they need runs that remove or scramble that attention module and show the drop. Without those, the efficiency gains and hardware stability could come from the fusion of vision and proprioception or the curriculum instead. The abstract also skips the actual numbers on training curves or ablations, which makes it hard to gauge how much better this is than prior attentive controllers.

This is aimed at the legged robotics and RL control crowd. Someone building perceptive policies would find the setup useful to build on, even if they have to add their own controls later.

I would send it out for review. The hardware claims are concrete enough to merit referee input, provided the full paper fills in the missing comparisons and ablations.

Referee Report

2 major / 1 minor

Summary. The paper introduces TAGA, a Terrain-aware Active Gaze learning framework for attention-based humanoid locomotion. By fusing vision, proprioception, and motion commands, the framework is claimed to enable the model to learn anticipatory cues and actively attend to informative regions of the height scan. The authors state that such gaze behaviors emerge naturally through reinforcement learning without additional supervision or explicit guidance, improving training efficiency and yielding robust, generalizable policies. These policies are reported to achieve reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold performance, and the largest reported real-world gap traversal of 1.2 m among perceptive humanoid systems, while remaining stable under perceptual disturbances.

Significance. If the central claims hold after addressing the noted gaps, the work would be significant for perceptive humanoid locomotion by providing evidence that active gaze can emerge unsupervised in RL pipelines, potentially improving observation efficiency under onboard compute limits and enabling larger-scale terrain traversal. The reported hardware gap distance would represent a concrete benchmark advance if supported by detailed, reproducible methods and controls.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): The headline claims that gaze behaviors 'naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance' and directly cause the reported training-efficiency gains and hardware robustness (including 1.2 m gap traversal) are load-bearing for the central contribution, yet no ablation studies are described that disable or randomize the attention-guidance component while holding the rest of the RL pipeline fixed. Without such controls, it is impossible to isolate whether performance stems from emergent gaze or from other elements such as proprioceptive fusion, reward shaping, or curriculum design.
[Abstract] Abstract: All quantitative performance claims (terrain-aware footholds, 1.2 m gap traversal, stability under perceptual noise) are stated without accompanying metrics, error bars, training curves, or ablation tables. This absence prevents assessment of effect sizes and statistical reliability of the sim-to-real transfer attributed to the gaze mechanism.

minor comments (1)

[Abstract] The abstract refers to 'the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems' without citing the specific prior works used for comparison; a table or explicit references in the main text would clarify the baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses

Referee: The headline claims that gaze behaviors 'naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance' and directly cause the reported training-efficiency gains and hardware robustness (including 1.2 m gap traversal) are load-bearing for the central contribution, yet no ablation studies are described that disable or randomize the attention-guidance component while holding the rest of the RL pipeline fixed. Without such controls, it is impossible to isolate whether performance stems from emergent gaze or from other elements such as proprioceptive fusion, reward shaping, or curriculum design.

Authors: We agree that dedicated ablation studies isolating the attention-guidance component are needed to rigorously support the emergence claim. In the revised manuscript, we will add experiments that disable or randomize the active gaze selection while holding the RL pipeline, rewards, proprioception, and curriculum fixed, to quantify its specific contribution to training efficiency and robustness. revision: yes
Referee: All quantitative performance claims (terrain-aware footholds, 1.2 m gap traversal, stability under perceptual noise) are stated without accompanying metrics, error bars, training curves, or ablation tables. This absence prevents assessment of effect sizes and statistical reliability of the sim-to-real transfer attributed to the gaze mechanism.

Authors: We will revise the abstract to report key quantitative metrics with error bars and statistical details. We will also expand §4 to include comprehensive training curves, ablation tables, and effect-size metrics supporting the performance claims and sim-to-real transfer. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical RL training outcomes

full rationale

The paper presents its core results as outcomes of reinforcement learning training on a terrain-aware active gaze framework. No equations, fitted parameters, or self-citations are shown in the provided text that reduce the reported performance gains (e.g., 1.2m gap traversal) to definitional inputs or prior self-referential results by construction. The statement that gaze behaviors 'naturally emerge through reinforcement learning alone' is framed as an empirical observation rather than a self-definitional or fitted prediction. The derivation chain is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit modeling choices or new postulated quantities are described.

pith-pipeline@v0.9.1-grok · 5765 in / 994 out tokens · 11622 ms · 2026-06-28T01:16:16.172812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 1 canonical work pages

[1]

Sombolestan and Q

M. Sombolestan and Q. Nguyen. Adaptive-force-based control of dynamic legged locomotion over uneven terrain.IEEE Transactions on Robotics, 40:2462–2477, 2024

2024
[2]

Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

arXiv 2025
[3]

Zhang, Y

Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

arXiv 2025
[4]

N. Fey, G. B. Margolis, M. Peticco, and P. Agrawal. Bridging the sim-to-real gap for athletic loco-manipulation.arXiv preprint arXiv:2502.10894, 2025

arXiv 2025
[5]

Murooka, K

M. Murooka, K. Chappellet, A. Tanguy, M. Benallegue, I. Kumagai, M. Morisawa, F. Kane- hiro, and A. Kheddar. Humanoid loco-manipulations pattern generation and stabilization con- trol.IEEE Robotics and Automation Letters, 6(3):5597–5604, 2021

2021
[6]

Bouyarmane, K

K. Bouyarmane, K. Chappellet, J. Vaillant, and A. Kheddar. Quadratic programming for mul- tirobot and task-space force control.IEEE Transactions on Robotics, 35(1):64–77, 2018

2018
[7]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024
[8]

J. He, C. Zhang, F. Jenelten, R. Grandia, M. B ¨acher, and M. Hutter. Attention-based map encoding for learning generalized legged locomotion.Science Robotics, 10(105):eadv3604, 2025

2025
[9]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822, 2022

2022
[10]

J. Long, J. Ren, M. Shi, Z. Wang, T. Huang, P. Luo, and J. Pang. Learning humanoid locomo- tion with perceptive internal model. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9997–10003. IEEE, 2025

2025
[11]

Hoeller, N

D. Hoeller, N. Rudin, D. Sako, and M. Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

2024
[12]

Cheng, K

X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450. IEEE, 2024

2024
[13]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on robot learning, pages 403–415. PMLR, 2023

2023
[14]

H. Song, H. Zhu, T. Yu, Y . Liu, M. Yuan, W. Zhou, H. Chen, and H. Li. Gait-adaptive per- ceptive humanoid locomotion with real-time under-base terrain reconstruction.IEEE Robotics and Automation Letters, 2026

2026
[15]

R. Yang, M. Zhang, N. Hansen, H. Xu, and X. Wang. Learning vision-guided quadrupedal lo- comotion end-to-end with cross-modal transformers. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=kFdPX1VdgXx

2022
[16]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. In8th Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=fs7ia3FqUM

2024
[17]

Zhang, V

C. Zhang, V . Klemm, F. Yang, and M. Hutter. Ame-2: Agile and generalized legged locomotion via attention-based neural map encoding.arXiv preprint arXiv:2601.08485, 2026. 9

arXiv 2026
[18]

Fankhauser, M

P. Fankhauser, M. Bjelonic, C. D. Bellicoso, T. Miki, and M. Hutter. Robust rough-terrain locomotion with a quadrupedal robot. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5761–5768. IEEE, 2018

2018
[19]

Jenelten, T

F. Jenelten, T. Miki, A. E. Vijayan, M. Bjelonic, and M. Hutter. Perceptive locomotion in rough terrain–online foothold optimization.IEEE Robotics and Automation Letters, 5(4):5370–5376, 2020

2020
[20]

Z. Wang, Y . Li, L. Xu, H. Shi, Z. Ma, Z. Chu, C. Li, F. Gao, K. Yang, and K. Wang. Sf-tim: A simple framework for enhancing quadrupedal robot jumping agility by combining terrain imagination and measurement. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10676–10683. IEEE, 2025

2025
[21]

T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, and M. Hutter. Elevation map- ping for locomotion and navigation using gpu. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2273–2280. IEEE, 2022

2022
[22]

Y . Dong, J. Ma, L. Zhao, W. Li, and P. Lu. Marg: Mastering risky gap terrains for legged robots with elevation mapping.IEEE Transactions on Robotics, 2025

2025
[23]

Zhang, N

C. Zhang, N. Rudin, D. Hoeller, and M. Hutter. Learning agile locomotion on risky terrains. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11864–11871. IEEE, 2024

2024
[24]

Y . Chen, J. Ma, Z. Luo, Y . Han, Y . Dong, B. Xu, and P. Lu. Learning autonomous and safe quadruped traversal of complex terrains using multi-layer elevation maps.IEEE Robotics and Automation Letters, 2025

2025
[25]

T. Miki, J. Lee, L. Wellhausen, and M. Hutter. Learning to walk in confined spaces using 3d representation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8649–8656. IEEE, 2024

2024
[26]

Fankhauser, M

P. Fankhauser, M. Bloesch, and M. Hutter. Probabilistic terrain mapping for mobile robots with uncertain localization.IEEE Robotics and Automation Letters, 3(4):3019–3026, 2018

2018
[27]

W. Yu, D. Jain, A. Escontrela, A. Iscen, P. Xu, E. Coumans, S. Ha, J. Tan, and T. Zhang. Visual- locomotion: Learning to walk on complex terrains with vision. In5th Annual Conference on Robot Learning, 2021

2021
[28]

H. Duan, B. Pandit, M. S. Gadde, B. Van Marum, J. Dao, C. Kim, and A. Fern. Learning vision- based bipedal locomotion for challenging terrain. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 56–62. IEEE, 2024

2024
[29]

H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang. Beamdojo: Learning agile humanoid locomotion on sparse footholds.arXiv preprint arXiv:2502.10363, 2025

arXiv 2025
[30]

Rudin, J

N. Rudin, J. He, J. Aurand, and M. Hutter. Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning.arXiv preprint arXiv:2505.11164, 2025

arXiv 2025
[31]

J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction.arXiv preprint arXiv:2510.07152, 2025

arXiv 2025
[32]

Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025. 10

arXiv 2025
[33]

S. Li, S. Luo, J. Wu, and Q. Zhu. Move: Multi-skill omnidirectional legged locomotion with limited view in 3d environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7647–7653. IEEE, 2025

2025
[34]

P. Li, H. Li, Y . Ma, L. Chang, X. Yang, R. Yu, Y . Zhang, Y . Cao, Q. Zhu, and G. Sartoretti. Kivi: Kinesthetic-visuospatial integration for dynamic and safe egocentric legged locomotion. arXiv preprint arXiv:2509.23650, 2025

Pith/arXiv arXiv 2025
[35]

S. Luo, S. Li, R. Yu, Z. Wang, J. Wu, and Q. Zhu. Pie: Parkour with implicit-explicit learning framework for legged robots.IEEE Robotics and Automation Letters, 9(11):9986–9993, 2024

2024
[36]

R. Yang, G. Yang, and X. Wang. Neural volumetric memory for visual locomotion control. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1430–1440, 2023

2023
[37]

H. Lai, J. Cao, J. Xu, H. Wu, Y . Lin, T. Kong, Y . Yu, and W. Zhang. World model-based perception for visual legged locomotion. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11531–11537. IEEE, 2025

2025
[38]

Zhang, J

C. Zhang, J. Jin, J. Frey, N. Rudin, M. Mattamala, C. Cadena, and M. Hutter. Resilient legged local navigation: Learning to traverse with compromised perception end-to-end. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 34–41. IEEE, 2024

2024
[39]

Hoeller, N

D. Hoeller, N. Rudin, C. Choy, A. Anandkumar, and M. Hutter. Neural scene representation for locomotion on structured terrain.IEEE Robotics and Automation Letters, 7(4):8667–8674, 2022

2022
[40]

R. Yu, Q. Wang, H. Li, Z. Jun, Z. Wang, J. Wu, and Q. Zhu. Start: Traversing sparse footholds with terrain reconstruction.IEEE Robotics and Automation Letters, 11(2):2194–2201, 2025

2025
[41]

F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter. Spatially-enhanced recur- rent memory for long-range mapless navigation via end-to-end reinforcement learning.The International Journal of Robotics Research, page 02783649251401926, 2025

2025
[42]

A. Reed, B. Crowe, D. Albin, L. Achey, B. Hayes, and C. Heckman. Scenesense: Diffusion models for 3d occupancy synthesis from partial observation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7383–7390. IEEE, 2024

2024
[43]

S. Shao, T. Huang, W. Gao, and S. Zhang. Adapt: Adaptive dual-projection architecture for perceptive traversal.arXiv preprint arXiv:2603.16328, 2026

arXiv 2026
[44]

Singh, Y

K. Singh, Y . Kim, Y . Turkar, and K. Dantu. Cart: Context-aware terrain adaptation using temporal sequence selection for legged robots.arXiv preprint arXiv:2604.14344, 2026

Pith/arXiv arXiv 2026
[45]

S. Ma, H. Chen, Z. Xu, Y . Zhao, K. Wu, R. Yang, L. Zou, Z. Gan, and W. Ding. Cmoe: Contrastive mixture of experts for motion control and terrain adaptation of humanoid robots. arXiv preprint arXiv:2603.03067, 2026

arXiv 2026
[46]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

arXiv 2025
[47]

Mittal, N

M. Mittal, N. Rudin, V . Klemm, A. Allshire, and M. Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

2024
[48]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20,
[49]

doi:10.1145/3450626.3459670

ISSN 1557-7368. doi:10.1145/3450626.3459670. URLhttp://dx.doi.org/10. 1145/3450626.3459670. 11

work page doi:10.1145/3450626.3459670
[50]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

Pith/arXiv arXiv 2025
[51]

Y . Hao, R. Yu, S. Luo, G. Zhang, J. Wu, and Q. Zhu. Cref: Cross-modal and recurrent fusion for depth-conditioned humanoid locomotion.arXiv preprint arXiv:2603.29452, 2026

arXiv 2026
[52]

W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, B. Cao, Y . Liu, et al. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026

Pith/arXiv arXiv 2026
[53]

D. Wang, X. Wang, X. Liu, J. Shi, Y . Zhao, C. Bai, and X. Li. More: Mixture of residual experts for humanoid lifelike gaits learning on complex terrains.arXiv preprint arXiv:2506.08840, 2025

arXiv 2025
[54]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

arXiv 2026
[55]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026
[56]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, Oct. 2019. 12 A The Details of POMDP We formulate humanoid perceptive locomotion as a partially observable Markov decision process (POMDP), denoted as a 6-tupleM=⟨S,O,A,P,R...

2019

[1] [1]

Sombolestan and Q

M. Sombolestan and Q. Nguyen. Adaptive-force-based control of dynamic legged locomotion over uneven terrain.IEEE Transactions on Robotics, 40:2462–2477, 2024

2024

[2] [2]

Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

arXiv 2025

[3] [3]

Zhang, Y

Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

arXiv 2025

[4] [4]

N. Fey, G. B. Margolis, M. Peticco, and P. Agrawal. Bridging the sim-to-real gap for athletic loco-manipulation.arXiv preprint arXiv:2502.10894, 2025

arXiv 2025

[5] [5]

Murooka, K

M. Murooka, K. Chappellet, A. Tanguy, M. Benallegue, I. Kumagai, M. Morisawa, F. Kane- hiro, and A. Kheddar. Humanoid loco-manipulations pattern generation and stabilization con- trol.IEEE Robotics and Automation Letters, 6(3):5597–5604, 2021

2021

[6] [6]

Bouyarmane, K

K. Bouyarmane, K. Chappellet, J. Vaillant, and A. Kheddar. Quadratic programming for mul- tirobot and task-space force control.IEEE Transactions on Robotics, 35(1):64–77, 2018

2018

[7] [7]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024

[8] [8]

J. He, C. Zhang, F. Jenelten, R. Grandia, M. B ¨acher, and M. Hutter. Attention-based map encoding for learning generalized legged locomotion.Science Robotics, 10(105):eadv3604, 2025

2025

[9] [9]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822, 2022

2022

[10] [10]

J. Long, J. Ren, M. Shi, Z. Wang, T. Huang, P. Luo, and J. Pang. Learning humanoid locomo- tion with perceptive internal model. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9997–10003. IEEE, 2025

2025

[11] [11]

Hoeller, N

D. Hoeller, N. Rudin, D. Sako, and M. Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024

2024

[12] [12]

Cheng, K

X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450. IEEE, 2024

2024

[13] [13]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on robot learning, pages 403–415. PMLR, 2023

2023

[14] [14]

H. Song, H. Zhu, T. Yu, Y . Liu, M. Yuan, W. Zhou, H. Chen, and H. Li. Gait-adaptive per- ceptive humanoid locomotion with real-time under-base terrain reconstruction.IEEE Robotics and Automation Letters, 2026

2026

[15] [15]

R. Yang, M. Zhang, N. Hansen, H. Xu, and X. Wang. Learning vision-guided quadrupedal lo- comotion end-to-end with cross-modal transformers. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=kFdPX1VdgXx

2022

[16] [16]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. In8th Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=fs7ia3FqUM

2024

[17] [17]

Zhang, V

C. Zhang, V . Klemm, F. Yang, and M. Hutter. Ame-2: Agile and generalized legged locomotion via attention-based neural map encoding.arXiv preprint arXiv:2601.08485, 2026. 9

arXiv 2026

[18] [18]

Fankhauser, M

P. Fankhauser, M. Bjelonic, C. D. Bellicoso, T. Miki, and M. Hutter. Robust rough-terrain locomotion with a quadrupedal robot. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5761–5768. IEEE, 2018

2018

[19] [19]

Jenelten, T

F. Jenelten, T. Miki, A. E. Vijayan, M. Bjelonic, and M. Hutter. Perceptive locomotion in rough terrain–online foothold optimization.IEEE Robotics and Automation Letters, 5(4):5370–5376, 2020

2020

[20] [20]

Z. Wang, Y . Li, L. Xu, H. Shi, Z. Ma, Z. Chu, C. Li, F. Gao, K. Yang, and K. Wang. Sf-tim: A simple framework for enhancing quadrupedal robot jumping agility by combining terrain imagination and measurement. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10676–10683. IEEE, 2025

2025

[21] [21]

T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, and M. Hutter. Elevation map- ping for locomotion and navigation using gpu. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2273–2280. IEEE, 2022

2022

[22] [22]

Y . Dong, J. Ma, L. Zhao, W. Li, and P. Lu. Marg: Mastering risky gap terrains for legged robots with elevation mapping.IEEE Transactions on Robotics, 2025

2025

[23] [23]

Zhang, N

C. Zhang, N. Rudin, D. Hoeller, and M. Hutter. Learning agile locomotion on risky terrains. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11864–11871. IEEE, 2024

2024

[24] [24]

Y . Chen, J. Ma, Z. Luo, Y . Han, Y . Dong, B. Xu, and P. Lu. Learning autonomous and safe quadruped traversal of complex terrains using multi-layer elevation maps.IEEE Robotics and Automation Letters, 2025

2025

[25] [25]

T. Miki, J. Lee, L. Wellhausen, and M. Hutter. Learning to walk in confined spaces using 3d representation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8649–8656. IEEE, 2024

2024

[26] [26]

Fankhauser, M

P. Fankhauser, M. Bloesch, and M. Hutter. Probabilistic terrain mapping for mobile robots with uncertain localization.IEEE Robotics and Automation Letters, 3(4):3019–3026, 2018

2018

[27] [27]

W. Yu, D. Jain, A. Escontrela, A. Iscen, P. Xu, E. Coumans, S. Ha, J. Tan, and T. Zhang. Visual- locomotion: Learning to walk on complex terrains with vision. In5th Annual Conference on Robot Learning, 2021

2021

[28] [28]

H. Duan, B. Pandit, M. S. Gadde, B. Van Marum, J. Dao, C. Kim, and A. Fern. Learning vision- based bipedal locomotion for challenging terrain. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 56–62. IEEE, 2024

2024

[29] [29]

H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang. Beamdojo: Learning agile humanoid locomotion on sparse footholds.arXiv preprint arXiv:2502.10363, 2025

arXiv 2025

[30] [30]

Rudin, J

N. Rudin, J. He, J. Aurand, and M. Hutter. Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning.arXiv preprint arXiv:2505.11164, 2025

arXiv 2025

[31] [31]

J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction.arXiv preprint arXiv:2510.07152, 2025

arXiv 2025

[32] [32]

Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025. 10

arXiv 2025

[33] [33]

S. Li, S. Luo, J. Wu, and Q. Zhu. Move: Multi-skill omnidirectional legged locomotion with limited view in 3d environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7647–7653. IEEE, 2025

2025

[34] [34]

P. Li, H. Li, Y . Ma, L. Chang, X. Yang, R. Yu, Y . Zhang, Y . Cao, Q. Zhu, and G. Sartoretti. Kivi: Kinesthetic-visuospatial integration for dynamic and safe egocentric legged locomotion. arXiv preprint arXiv:2509.23650, 2025

Pith/arXiv arXiv 2025

[35] [35]

S. Luo, S. Li, R. Yu, Z. Wang, J. Wu, and Q. Zhu. Pie: Parkour with implicit-explicit learning framework for legged robots.IEEE Robotics and Automation Letters, 9(11):9986–9993, 2024

2024

[36] [36]

R. Yang, G. Yang, and X. Wang. Neural volumetric memory for visual locomotion control. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1430–1440, 2023

2023

[37] [37]

H. Lai, J. Cao, J. Xu, H. Wu, Y . Lin, T. Kong, Y . Yu, and W. Zhang. World model-based perception for visual legged locomotion. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11531–11537. IEEE, 2025

2025

[38] [38]

Zhang, J

C. Zhang, J. Jin, J. Frey, N. Rudin, M. Mattamala, C. Cadena, and M. Hutter. Resilient legged local navigation: Learning to traverse with compromised perception end-to-end. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 34–41. IEEE, 2024

2024

[39] [39]

Hoeller, N

D. Hoeller, N. Rudin, C. Choy, A. Anandkumar, and M. Hutter. Neural scene representation for locomotion on structured terrain.IEEE Robotics and Automation Letters, 7(4):8667–8674, 2022

2022

[40] [40]

R. Yu, Q. Wang, H. Li, Z. Jun, Z. Wang, J. Wu, and Q. Zhu. Start: Traversing sparse footholds with terrain reconstruction.IEEE Robotics and Automation Letters, 11(2):2194–2201, 2025

2025

[41] [41]

F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter. Spatially-enhanced recur- rent memory for long-range mapless navigation via end-to-end reinforcement learning.The International Journal of Robotics Research, page 02783649251401926, 2025

2025

[42] [42]

A. Reed, B. Crowe, D. Albin, L. Achey, B. Hayes, and C. Heckman. Scenesense: Diffusion models for 3d occupancy synthesis from partial observation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7383–7390. IEEE, 2024

2024

[43] [43]

S. Shao, T. Huang, W. Gao, and S. Zhang. Adapt: Adaptive dual-projection architecture for perceptive traversal.arXiv preprint arXiv:2603.16328, 2026

arXiv 2026

[44] [44]

Singh, Y

K. Singh, Y . Kim, Y . Turkar, and K. Dantu. Cart: Context-aware terrain adaptation using temporal sequence selection for legged robots.arXiv preprint arXiv:2604.14344, 2026

Pith/arXiv arXiv 2026

[45] [45]

S. Ma, H. Chen, Z. Xu, Y . Zhao, K. Wu, R. Yang, L. Zou, Z. Gan, and W. Ding. Cmoe: Contrastive mixture of experts for motion control and terrain adaptation of humanoid robots. arXiv preprint arXiv:2603.03067, 2026

arXiv 2026

[46] [46]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

arXiv 2025

[47] [47]

Mittal, N

M. Mittal, N. Rudin, V . Klemm, A. Allshire, and M. Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024

2024

[48] [48]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20,

[49] [49]

doi:10.1145/3450626.3459670

ISSN 1557-7368. doi:10.1145/3450626.3459670. URLhttp://dx.doi.org/10. 1145/3450626.3459670. 11

work page doi:10.1145/3450626.3459670

[50] [50]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

Pith/arXiv arXiv 2025

[51] [51]

Y . Hao, R. Yu, S. Luo, G. Zhang, J. Wu, and Q. Zhu. Cref: Cross-modal and recurrent fusion for depth-conditioned humanoid locomotion.arXiv preprint arXiv:2603.29452, 2026

arXiv 2026

[52] [52]

W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, B. Cao, Y . Liu, et al. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026

Pith/arXiv arXiv 2026

[53] [53]

D. Wang, X. Wang, X. Liu, J. Shi, Y . Zhao, C. Bai, and X. Li. More: Mixture of residual experts for humanoid lifelike gaits learning on complex terrains.arXiv preprint arXiv:2506.08840, 2025

arXiv 2025

[54] [54]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

arXiv 2026

[55] [55]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026

[56] [56]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, Oct. 2019. 12 A The Details of POMDP We formulate humanoid perceptive locomotion as a partially observable Markov decision process (POMDP), denoted as a 6-tupleM=⟨S,O,A,P,R...

2019