TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion
Pith reviewed 2026-06-28 01:16 UTC · model grok-4.3
The pith
A terrain-aware active gaze framework lets reinforcement learning produce selective attention that enables 1.2 meter real-world gap crossings for humanoid robots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fusing vision, proprioception, and motion commands, the attention-based controller learns to attend to specific regions of the height scan and uses those regions for downstream action selection. Gaze behaviors emerge through reinforcement learning without additional supervision or explicit guidance, raising observation efficiency and allowing fine-grained perceptive locomotion across larger terrains. The resulting policy produces robust simulation-to-hardware transfer that includes terrain-aware foothold selection, elevated-platform crossing, competitive sparse-foothold performance, and the largest reported real-world gap traversal of 1.2 meters while remaining stable under perceptual dis
What carries the argument
The Terrain-aware Active Gaze (TAGA) attention module that adaptively selects informative height-scan patches before they reach the policy network.
If this is right
- The policy performs reliable terrain-aware foothold selection across varied surfaces.
- It traverses elevated platforms without loss of balance.
- It matches or exceeds prior methods on sparse-foothold sequences.
- It achieves 1.2 meter gap crossings in hardware, the largest reported for perceptive humanoids.
- It keeps balance under severe perceptual noise and environmental interference.
Where Pith is reading between the lines
- The same attention emergence might reduce the need for separate perception pipelines in other sensor-rich control tasks.
- Policies trained this way could generalize to new robot morphologies with only retraining of the attention weights.
- If the pattern holds, future work could test whether similar unsupervised attention appears in non-locomotion settings such as arm manipulation over cluttered tables.
Load-bearing premise
Gaze behaviors will emerge from ordinary reinforcement learning without any extra supervision and that this emergence is what produces the reported hardware performance gains.
What would settle it
An ablation that removes the learned attention module or forces uniform sampling of the height scan and then measures whether the policy still reaches 1.2 meter real-world gap crossings with comparable stability.
Figures
read the original abstract
Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TAGA, a Terrain-aware Active Gaze learning framework for attention-based humanoid locomotion. By fusing vision, proprioception, and motion commands, the framework is claimed to enable the model to learn anticipatory cues and actively attend to informative regions of the height scan. The authors state that such gaze behaviors emerge naturally through reinforcement learning without additional supervision or explicit guidance, improving training efficiency and yielding robust, generalizable policies. These policies are reported to achieve reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold performance, and the largest reported real-world gap traversal of 1.2 m among perceptive humanoid systems, while remaining stable under perceptual disturbances.
Significance. If the central claims hold after addressing the noted gaps, the work would be significant for perceptive humanoid locomotion by providing evidence that active gaze can emerge unsupervised in RL pipelines, potentially improving observation efficiency under onboard compute limits and enabling larger-scale terrain traversal. The reported hardware gap distance would represent a concrete benchmark advance if supported by detailed, reproducible methods and controls.
major comments (2)
- [Abstract and §4] Abstract and §4 (Results): The headline claims that gaze behaviors 'naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance' and directly cause the reported training-efficiency gains and hardware robustness (including 1.2 m gap traversal) are load-bearing for the central contribution, yet no ablation studies are described that disable or randomize the attention-guidance component while holding the rest of the RL pipeline fixed. Without such controls, it is impossible to isolate whether performance stems from emergent gaze or from other elements such as proprioceptive fusion, reward shaping, or curriculum design.
- [Abstract] Abstract: All quantitative performance claims (terrain-aware footholds, 1.2 m gap traversal, stability under perceptual noise) are stated without accompanying metrics, error bars, training curves, or ablation tables. This absence prevents assessment of effect sizes and statistical reliability of the sim-to-real transfer attributed to the gaze mechanism.
minor comments (1)
- [Abstract] The abstract refers to 'the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems' without citing the specific prior works used for comparison; a table or explicit references in the main text would clarify the baseline.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the evidence for our claims.
read point-by-point responses
-
Referee: The headline claims that gaze behaviors 'naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance' and directly cause the reported training-efficiency gains and hardware robustness (including 1.2 m gap traversal) are load-bearing for the central contribution, yet no ablation studies are described that disable or randomize the attention-guidance component while holding the rest of the RL pipeline fixed. Without such controls, it is impossible to isolate whether performance stems from emergent gaze or from other elements such as proprioceptive fusion, reward shaping, or curriculum design.
Authors: We agree that dedicated ablation studies isolating the attention-guidance component are needed to rigorously support the emergence claim. In the revised manuscript, we will add experiments that disable or randomize the active gaze selection while holding the RL pipeline, rewards, proprioception, and curriculum fixed, to quantify its specific contribution to training efficiency and robustness. revision: yes
-
Referee: All quantitative performance claims (terrain-aware footholds, 1.2 m gap traversal, stability under perceptual noise) are stated without accompanying metrics, error bars, training curves, or ablation tables. This absence prevents assessment of effect sizes and statistical reliability of the sim-to-real transfer attributed to the gaze mechanism.
Authors: We will revise the abstract to report key quantitative metrics with error bars and statistical details. We will also expand §4 to include comprehensive training curves, ablation tables, and effect-size metrics supporting the performance claims and sim-to-real transfer. revision: yes
Circularity Check
No circularity; claims rest on empirical RL training outcomes
full rationale
The paper presents its core results as outcomes of reinforcement learning training on a terrain-aware active gaze framework. No equations, fitted parameters, or self-citations are shown in the provided text that reduce the reported performance gains (e.g., 1.2m gap traversal) to definitional inputs or prior self-referential results by construction. The statement that gaze behaviors 'naturally emerge through reinforcement learning alone' is framed as an empirical observation rather than a self-definitional or fitted prediction. The derivation chain is therefore self-contained against external benchmarks and receives the default non-finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sombolestan and Q
M. Sombolestan and Q. Nguyen. Adaptive-force-based control of dynamic legged locomotion over uneven terrain.IEEE Transactions on Robotics, 40:2462–2477, 2024
2024
-
[2]
Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025
arXiv 2025
- [3]
-
[4]
N. Fey, G. B. Margolis, M. Peticco, and P. Agrawal. Bridging the sim-to-real gap for athletic loco-manipulation.arXiv preprint arXiv:2502.10894, 2025
arXiv 2025
-
[5]
Murooka, K
M. Murooka, K. Chappellet, A. Tanguy, M. Benallegue, I. Kumagai, M. Morisawa, F. Kane- hiro, and A. Kheddar. Humanoid loco-manipulations pattern generation and stabilization con- trol.IEEE Robotics and Automation Letters, 6(3):5597–5604, 2021
2021
-
[6]
Bouyarmane, K
K. Bouyarmane, K. Chappellet, J. Vaillant, and A. Kheddar. Quadratic programming for mul- tirobot and task-space force control.IEEE Transactions on Robotics, 35(1):64–77, 2018
2018
-
[7]
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024
arXiv 2024
-
[8]
J. He, C. Zhang, F. Jenelten, R. Grandia, M. B ¨acher, and M. Hutter. Attention-based map encoding for learning generalized legged locomotion.Science Robotics, 10(105):eadv3604, 2025
2025
-
[9]
T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822, 2022
2022
-
[10]
J. Long, J. Ren, M. Shi, Z. Wang, T. Huang, P. Luo, and J. Pang. Learning humanoid locomo- tion with perceptive internal model. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9997–10003. IEEE, 2025
2025
-
[11]
Hoeller, N
D. Hoeller, N. Rudin, D. Sako, and M. Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots.Science Robotics, 9(88):eadi7566, 2024
2024
-
[12]
Cheng, K
X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450. IEEE, 2024
2024
-
[13]
Agarwal, A
A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. InConference on robot learning, pages 403–415. PMLR, 2023
2023
-
[14]
H. Song, H. Zhu, T. Yu, Y . Liu, M. Yuan, W. Zhou, H. Chen, and H. Li. Gait-adaptive per- ceptive humanoid locomotion with real-time under-base terrain reconstruction.IEEE Robotics and Automation Letters, 2026
2026
-
[15]
R. Yang, M. Zhang, N. Hansen, H. Xu, and X. Wang. Learning vision-guided quadrupedal lo- comotion end-to-end with cross-modal transformers. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=kFdPX1VdgXx
2022
-
[16]
Zhuang, S
Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning. In8th Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=fs7ia3FqUM
2024
- [17]
-
[18]
Fankhauser, M
P. Fankhauser, M. Bjelonic, C. D. Bellicoso, T. Miki, and M. Hutter. Robust rough-terrain locomotion with a quadrupedal robot. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5761–5768. IEEE, 2018
2018
-
[19]
Jenelten, T
F. Jenelten, T. Miki, A. E. Vijayan, M. Bjelonic, and M. Hutter. Perceptive locomotion in rough terrain–online foothold optimization.IEEE Robotics and Automation Letters, 5(4):5370–5376, 2020
2020
-
[20]
Z. Wang, Y . Li, L. Xu, H. Shi, Z. Ma, Z. Chu, C. Li, F. Gao, K. Yang, and K. Wang. Sf-tim: A simple framework for enhancing quadrupedal robot jumping agility by combining terrain imagination and measurement. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10676–10683. IEEE, 2025
2025
-
[21]
T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, and M. Hutter. Elevation map- ping for locomotion and navigation using gpu. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2273–2280. IEEE, 2022
2022
-
[22]
Y . Dong, J. Ma, L. Zhao, W. Li, and P. Lu. Marg: Mastering risky gap terrains for legged robots with elevation mapping.IEEE Transactions on Robotics, 2025
2025
-
[23]
Zhang, N
C. Zhang, N. Rudin, D. Hoeller, and M. Hutter. Learning agile locomotion on risky terrains. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11864–11871. IEEE, 2024
2024
-
[24]
Y . Chen, J. Ma, Z. Luo, Y . Han, Y . Dong, B. Xu, and P. Lu. Learning autonomous and safe quadruped traversal of complex terrains using multi-layer elevation maps.IEEE Robotics and Automation Letters, 2025
2025
-
[25]
T. Miki, J. Lee, L. Wellhausen, and M. Hutter. Learning to walk in confined spaces using 3d representation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8649–8656. IEEE, 2024
2024
-
[26]
Fankhauser, M
P. Fankhauser, M. Bloesch, and M. Hutter. Probabilistic terrain mapping for mobile robots with uncertain localization.IEEE Robotics and Automation Letters, 3(4):3019–3026, 2018
2018
-
[27]
W. Yu, D. Jain, A. Escontrela, A. Iscen, P. Xu, E. Coumans, S. Ha, J. Tan, and T. Zhang. Visual- locomotion: Learning to walk on complex terrains with vision. In5th Annual Conference on Robot Learning, 2021
2021
-
[28]
H. Duan, B. Pandit, M. S. Gadde, B. Van Marum, J. Dao, C. Kim, and A. Fern. Learning vision- based bipedal locomotion for challenging terrain. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 56–62. IEEE, 2024
2024
-
[29]
H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang. Beamdojo: Learning agile humanoid locomotion on sparse footholds.arXiv preprint arXiv:2502.10363, 2025
arXiv 2025
- [30]
-
[31]
J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y . Guo, and Q. Zhang. Dpl: Depth- only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction.arXiv preprint arXiv:2510.07152, 2025
arXiv 2025
-
[32]
Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025. 10
arXiv 2025
-
[33]
S. Li, S. Luo, J. Wu, and Q. Zhu. Move: Multi-skill omnidirectional legged locomotion with limited view in 3d environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7647–7653. IEEE, 2025
2025
-
[34]
P. Li, H. Li, Y . Ma, L. Chang, X. Yang, R. Yu, Y . Zhang, Y . Cao, Q. Zhu, and G. Sartoretti. Kivi: Kinesthetic-visuospatial integration for dynamic and safe egocentric legged locomotion. arXiv preprint arXiv:2509.23650, 2025
Pith/arXiv arXiv 2025
-
[35]
S. Luo, S. Li, R. Yu, Z. Wang, J. Wu, and Q. Zhu. Pie: Parkour with implicit-explicit learning framework for legged robots.IEEE Robotics and Automation Letters, 9(11):9986–9993, 2024
2024
-
[36]
R. Yang, G. Yang, and X. Wang. Neural volumetric memory for visual locomotion control. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1430–1440, 2023
2023
-
[37]
H. Lai, J. Cao, J. Xu, H. Wu, Y . Lin, T. Kong, Y . Yu, and W. Zhang. World model-based perception for visual legged locomotion. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11531–11537. IEEE, 2025
2025
-
[38]
Zhang, J
C. Zhang, J. Jin, J. Frey, N. Rudin, M. Mattamala, C. Cadena, and M. Hutter. Resilient legged local navigation: Learning to traverse with compromised perception end-to-end. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 34–41. IEEE, 2024
2024
-
[39]
Hoeller, N
D. Hoeller, N. Rudin, C. Choy, A. Anandkumar, and M. Hutter. Neural scene representation for locomotion on structured terrain.IEEE Robotics and Automation Letters, 7(4):8667–8674, 2022
2022
-
[40]
R. Yu, Q. Wang, H. Li, Z. Jun, Z. Wang, J. Wu, and Q. Zhu. Start: Traversing sparse footholds with terrain reconstruction.IEEE Robotics and Automation Letters, 11(2):2194–2201, 2025
2025
-
[41]
F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter. Spatially-enhanced recur- rent memory for long-range mapless navigation via end-to-end reinforcement learning.The International Journal of Robotics Research, page 02783649251401926, 2025
2025
-
[42]
A. Reed, B. Crowe, D. Albin, L. Achey, B. Hayes, and C. Heckman. Scenesense: Diffusion models for 3d occupancy synthesis from partial observation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7383–7390. IEEE, 2024
2024
-
[43]
S. Shao, T. Huang, W. Gao, and S. Zhang. Adapt: Adaptive dual-projection architecture for perceptive traversal.arXiv preprint arXiv:2603.16328, 2026
arXiv 2026
-
[44]
K. Singh, Y . Kim, Y . Turkar, and K. Dantu. Cart: Context-aware terrain adaptation using temporal sequence selection for legged robots.arXiv preprint arXiv:2604.14344, 2026
Pith/arXiv arXiv 2026
-
[45]
S. Ma, H. Chen, Z. Xu, Y . Zhao, K. Wu, R. Yang, L. Zou, Z. Gan, and W. Ding. Cmoe: Contrastive mixture of experts for motion control and terrain adaptation of humanoid robots. arXiv preprint arXiv:2603.03067, 2026
arXiv 2026
-
[46]
C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025
arXiv 2025
-
[47]
Mittal, N
M. Mittal, N. Rudin, V . Klemm, A. Allshire, and M. Hutter. Symmetry considerations for learning task symmetric robot policies. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7439. IEEE, 2024
2024
-
[48]
X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20,
-
[49]
ISSN 1557-7368. doi:10.1145/3450626.3459670. URLhttp://dx.doi.org/10. 1145/3450626.3459670. 11
-
[50]
M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...
Pith/arXiv arXiv 2025
-
[51]
Y . Hao, R. Yu, S. Luo, G. Zhang, J. Wu, and Q. Zhu. Cref: Cross-modal and recurrent fusion for depth-conditioned humanoid locomotion.arXiv preprint arXiv:2603.29452, 2026
arXiv 2026
-
[52]
W. Sun, Y . Su, L. Huang, A. Zhang, D. Wei, M. San, D. Tian, E. Cao, B. Cao, Y . Liu, et al. Now you see that: Learning end-to-end humanoid locomotion from raw pixels.arXiv preprint arXiv:2602.06382, 2026
Pith/arXiv arXiv 2026
-
[53]
D. Wang, X. Wang, X. Liu, J. Shi, Y . Zhao, C. Bai, and X. Li. More: Mixture of residual experts for humanoid lifelike gaits learning on complex terrains.arXiv preprint arXiv:2506.08840, 2025
arXiv 2025
-
[54]
S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026
arXiv 2026
- [55]
-
[56]
Mahmood, N
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, Oct. 2019. 12 A The Details of POMDP We formulate humanoid perceptive locomotion as a partially observable Markov decision process (POMDP), denoted as a 6-tupleM=⟨S,O,A,P,R...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.