pith. sign in

arxiv: 2505.21282 · v2 · submitted 2025-05-27 · 💻 cs.RO

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

Pith reviewed 2026-05-19 12:53 UTC · model grok-4.3

classification 💻 cs.RO
keywords multimodal datasetrobot navigationhuman navigationimitation learningtraversability segmentationreal-world datadata collection pipelines
0
0 comments X

The pith

EgoWalk supplies 50 hours of multimodal human navigation recordings across diverse indoor, outdoor, and seasonal environments to train robot navigation systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoWalk, a dataset of 50 hours of human navigation through varied real-world settings that include different seasons and locations. It supplies raw multimodal sensor streams ready for imitation learning plus automated pipelines that generate natural language goal annotations and traversability segmentation masks. Diversity analyses, use cases, and benchmarks demonstrate how the data supports training of navigation algorithms that must operate in uncontrolled conditions. The authors release the processing pipelines and hardware platform details to enable wider use.

Core claim

We introduce EgoWalk - a dataset of 50 hours of human navigation in a diverse set of indoor/outdoor, varied seasons, and location environments. Along with the raw and Imitation Learning-ready data, we introduce several pipelines to automatically create subsidiary datasets for other navigation-related tasks, namely natural language goal annotations and traversability segmentation masks. Diversity studies, use cases, and benchmarks for the proposed dataset are provided to demonstrate its practical applicability.

What carries the argument

EgoWalk multimodal dataset of human navigation trajectories together with automated pipelines that produce natural language goal annotations and traversability segmentation masks.

If this is right

  • Navigation policies trained on EgoWalk data should generalize more reliably to messy real-world settings.
  • Automated annotation pipelines lower the manual effort needed to build task-specific navigation datasets.
  • Benchmarks establish baseline performance for imitation learning and traversability estimation using the data.
  • Open release of collection hardware details and code allows other researchers to replicate or extend the dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing EgoWalk recordings with existing simulation environments could create hybrid training regimes that cover rare edge cases.
  • The same sensor-plus-pipeline approach could be adapted to collect data for related embodied tasks such as object search or manipulation.
  • Over time, repeated use of such datasets may reduce the amount of on-site fine-tuning required when deploying navigation robots in new buildings or cities.

Load-bearing premise

The collected human navigation trajectories and sensor streams are sufficiently representative and high-quality to improve training and robustness of data-driven robot navigation algorithms in uncontrolled real-world conditions.

What would settle it

Train a navigation policy on EgoWalk data and compare its success rate in a new uncontrolled environment against policies trained on existing smaller datasets; if no measurable improvement appears, the utility claim is challenged.

Figures

Figures reproduced from arXiv: 2505.21282 by Alexander Mazurov, Arthur Nigmatzyanov, Denis Fatykhov, Diego Ruiz Salinas, German Devchich, Gonzalo Ferrer, Ivan Sosin, Javier Antonio Ramirez Benavides, Kristina Zipa, Malik Mohrat, Mohamad Al Mdfaa, Pavel Kolesnik, Sergey Bakulin, Timur Akhtyamov.

Figure 1
Figure 1. Figure 1: FIGURE 1: General overview of the data collection and processing pipelines. Sensor and odometry data are extracted from [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIGURE 2: Diversity of the dataset. Location labels were produced using a vision-language model [21]. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIGURE 3: Data collection platform: an example of the wearable rig (left) and detailed internal component layout (right). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIGURE 4: Overview of the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIGURE 5: Examples of the auto-generated traversability masks. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIGURE 6: Comparison between the different segmentation [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIGURE 7: Qualitative results from language annotations evaluation. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: FIGURE 8: Before gaussian splatting, the EgoWalk scene is [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: FIGURE 9: After reconstruction, we can render images [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: FIGURE 10: Rendered images with artifacts after Gaussian [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: FIGURE 11: Estimated trajectories for the outdoor and indoor [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

Data-driven navigation algorithms are critically dependent on large-scale, high-quality real-world data collection for successful training and robust performance in realistic and uncontrolled conditions. To enhance the growing family of navigation-related real-world datasets, we introduce EgoWalk - a dataset of 50 hours of human navigation in a diverse set of indoor/outdoor, varied seasons, and location environments. Along with the raw and Imitation Learning-ready data, we introduce several pipelines to automatically create subsidiary datasets for other navigation-related tasks, namely natural language goal annotations and traversability segmentation masks. Diversity studies, use cases, and benchmarks for the proposed dataset are provided to demonstrate its practical applicability. We openly release all data processing pipelines and the description of the hardware platform used for data collection to support future research and development in robot navigation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces EgoWalk, a multimodal dataset of 50 hours of human navigation trajectories collected across diverse indoor/outdoor environments, seasons, and locations. In addition to raw sensor streams and imitation-learning-ready data, it describes automatic pipelines that generate natural language goal annotations and traversability segmentation masks. The authors include diversity analyses, example use cases, and benchmarks, and they release the data-processing pipelines together with a hardware-platform description.

Significance. A well-validated release of this scale and diversity could strengthen data-driven navigation research by supplying real-world multimodal trajectories and auxiliary labels for tasks such as language-conditioned planning and traversability estimation. The open release of pipelines and hardware details is a clear reproducibility asset. The central utility claims, however, rest on the unquantified quality of the automatically generated labels.

major comments (1)
  1. [Section 4] Section 4: The pipelines that produce natural language goal annotations and traversability segmentation masks are described in detail, yet no quantitative validation (accuracy, precision-recall, or inter-annotator agreement with human labels) is reported. Because the manuscript’s assertions about “practical applicability” and utility for “subsidiary navigation-related tasks” depend on these labels being reliable, the absence of such metrics is load-bearing for the central contribution.
minor comments (2)
  1. [Abstract] Abstract: The phrase “diversity studies, use cases, and benchmarks” is stated without any concrete metrics or findings; adding one or two headline numbers would improve the abstract’s informativeness.
  2. [Diversity studies] The manuscript would benefit from an explicit statement of the total number of distinct environments and the distribution of hours across indoor versus outdoor settings to allow readers to assess claimed diversity more precisely.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment point by point below.

read point-by-point responses
  1. Referee: [Section 4] Section 4: The pipelines that produce natural language goal annotations and traversability segmentation masks are described in detail, yet no quantitative validation (accuracy, precision-recall, or inter-annotator agreement with human labels) is reported. Because the manuscript’s assertions about “practical applicability” and utility for “subsidiary navigation-related tasks” depend on these labels being reliable, the absence of such metrics is load-bearing for the central contribution.

    Authors: We appreciate the referee's observation that quantitative validation of the automatic pipelines would strengthen the claims of practical utility. The current manuscript provides detailed descriptions of the pipelines, open-source code for reproducibility, diversity analyses, and example use cases to demonstrate applicability, but we agree that explicit metrics such as accuracy, precision-recall, and agreement with human annotations are important for substantiating reliability. In the revised manuscript we will add a dedicated evaluation subsection that reports these quantitative metrics on a held-out, manually annotated subset of the data for both the language goal annotations and the traversability masks. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with no derivations or self-referential predictions

full rationale

The paper introduces EgoWalk as a 50-hour multimodal navigation dataset with automatic annotation pipelines for goals and traversability masks. No equations, fitted parameters, predictions, or derivation chains are present in the abstract or described content. The contribution is a data collection and release effort rather than a closed mathematical or predictive claim that could reduce to its own inputs by construction. Self-citations are not load-bearing here, and no uniqueness theorems or ansatzes are invoked. This is a standard honest non-finding for a dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that human egocentric navigation data can be directly repurposed for robot imitation learning; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human navigation trajectories recorded in the wild are suitable training material for data-driven robot navigation policies.
    Implicit in the claim that the dataset supports successful training and robust performance.

pith-pipeline@v0.9.0 · 5720 in / 1082 out tokens · 38190 ms · 2026-05-19T12:53:19.676941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

    cs.CV 2025-11 unverdicted novelty 7.0

    Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Social robot navigation through constrained optimization: A comprehensive study of uncertainty-based objectives and constraints in the simulated and real world

    Timur Akhtyamov, Aleksandr Kashirin, Aleksey Postnikov, Ivan Sosin, and Gonzalo Ferrer. Social robot navigation through constrained optimization: A comprehensive study of uncertainty-based objectives and constraints in the simulated and real world. Robotics and Autonomous Systems, 183:104830, 2025

  3. [3]

    Unified promptable panoptic mapping with dynamic labeling using foundation models

    Mohamad Al Mdfaa, Raghad Salameh, Geesara Kulathunga, Sergey Zagoruyko, and Gonzalo Ferrer. Unified promptable panoptic mapping with dynamic labeling using foundation models. Robotics, 15(2), 2026

  4. [4]

    Anthropic. Claude. https://www.anthropic.com, 2025. Large language model

  5. [5]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025. 10 Akhtyamovet al.: EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. URL https://arxiv. org/abs/2410.24164, 2024

  7. [7]

    Visual navigation for mobile robots: A survey

    Francisco Bonin-Font, Alberto Ortiz, and Gabriel Oliver. Visual navigation for mobile robots: A survey. Journal of intelligent and robotic systems, 53:263–296, 2008

  8. [8]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  9. [9]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018

  10. [10]

    A Survey of Sim-to-Real Methods in RL: Progress, Prospects and Challenges with Foundation Models

    Longchao Da, Justin Turnau, Thirulogasankar Pranav Kutralingam, Alvaro Velasquez, Paulo Shakarian, and Hua Wei. A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models. arXiv preprint arXiv:2502.13187, 2025

  11. [11]

    Deep writer AI writing assistant

    Deep Writer. Deep writer AI writing assistant. https://deepwriter.com, 2025

  12. [12]

    Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

    Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. In Conference on Robot Learning, 2024

  13. [13]

    Flownav: Combining flow matching and depth priors for efficient navigation,

    Samiran Gode, Abhijeet Nayak, and Wolfram Burgard. Flownav: Learning efficient navigation policies via conditional flow matching. arXiv preprint arXiv:2411.09524, 2024

  14. [14]

    Gemini 3 flash model card

    Google DeepMind. Gemini 3 flash model card. Technical report, Google DeepMind, December 2025

  15. [15]

    Vision- and-language navigation: A survey of tasks, methods, and future directions

    Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. Vision- and-language navigation: A survey of tasks, methods, and future directions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 7606–7623, Dublin, ...

  16. [16]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024

  17. [17]

    From seeing to experiencing: Scaling navigation foundation models with reinforcement learning

    Honglin He, Yukai Ma, Wayne Wu, and Bolei Zhou. From seeing to experiencing: Scaling navigation foundation models with reinforcement learning. arXiv preprint arXiv:2507.22028, 2025

  18. [18]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  19. [19]

    Lelan: Learning a language-conditioned navigation policy from in- the-wild video

    Noriaki Hirose, Catherine Glossop, Ajay Sridhar, Oier Mees, and Sergey Levine. Lelan: Learning a language-conditioned navigation policy from in- the-wild video. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors, Proceedings of The 8th Conference on Robot Learning, volume 270 of Proceedings of Machine Learning Research, pages 666–688. PMLR, 0...

  20. [20]

    Sacson: Scalable autonomous control for social navigation

    Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters, 9(1):49–56, 2023

  21. [21]

    CogVLM2: Visual Language Models for Image and Video Understanding

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video understanding. arXiv preprint arXiv:2408.16500, 2024

  22. [22]

    Segmentation models pytorch

    Pavel Iakubovskii. Segmentation models pytorch. https://github.com/ qubvel/segmentation_models.pytorch, 2019

  23. [23]

    Rellis- 3d dataset: Data, benchmarks and analysis

    Peng Jiang, Philip Osteen, Maggie Wigness, and Srikanth Saripalli. Rellis- 3d dataset: Data, benchmarks and analysis. In 2021 IEEE international conference on robotics and automation (ICRA), pages 1110–1116. IEEE, 2021

  24. [24]

    V-strong: Visual self-supervised traversability learning for off- road navigation

    Sanghun Jung, JoonHo Lee, Xiangyun Meng, Byron Boots, and Alexander Lambert. V-strong: Visual self-supervised traversability learning for off- road navigation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 1766–1773. IEEE, 2024

  25. [25]

    Socially Compliant Navigation Dataset (SCAND), 2022

    Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Soeren Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially Compliant Navigation Dataset (SCAND), 2022

  26. [26]

    Mapanything: Universal feed-forward metric 3d reconstruction

    Nikhil Varma Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstructio...

  27. [27]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 42(4):1–14, 2023

  28. [28]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Srirama, Lawrence Chen, Kirsty Ellis, Peter Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Ma, Patrick Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, and Chelsea Finn. Droid: A large-scale in-the-wild robot manipulatio...

  29. [29]

    Robotics: Science and Systems, R:SS ; Conference date: 15-07-2024 Through 19-07-2024

  30. [30]

    Transformer- based deep imitation learning for dual-arm robot manipulation

    Heecheol Kim, Yoshiyuki Ohmura, and Yasuo Kuniyoshi. Transformer- based deep imitation learning for dual-arm robot manipulation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8965–8972. IEEE, 2021

  31. [31]

    Learning semantic traversabil- ity with egocentric video and automated annotation strategy

    Yunho Kim, Jeong Hyun Lee, Choongin Lee, Juhyeok Mun, Donghoon Youm, Jeongsoo Park, and Jemin Hwangbo. Learning semantic traversabil- ity with egocentric video and automated annotation strategy. IEEE Robotics and Automation Letters, 2024

  32. [32]

    A unified architecture for instance and semantic segmentation

    Alexander Kirillov, Kaiming He, Ross Girshick, and Piotr Dollár. A unified architecture for instance and semantic segmentation. In Computer Vision and Pattern Recognition Conference. CVPR, 2017

  33. [33]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan- Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  34. [34]

    Pathdreamer: A world model for indoor navigation

    Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14738–14748, 2021

  35. [35]

    Visual navigation in real-world indoor environments using end-to-end deep reinforcement learning

    Jonáš Kulhánek, Erik Derner, and Robert Babuška. Visual navigation in real-world indoor environments using end-to-end deep reinforcement learning. IEEE Robotics and Automation Letters, 6(3):4345–4352, 2021

  36. [36]

    When humans aren’t optimal: Robots that collaborate with risk-aware humans

    Minae Kwon, Erdem Biyik, Aditi Talati, Karan Bhasin, Dylan P Losey, and Dorsa Sadigh. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human-robot interaction, pages 43–52, 2020

  37. [37]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025

  38. [38]

    Urbanverse: Scaling urban simulation by watching city-tour videos

    Mingxuan Liu, Honglin He, Elisa Ricci, Wayne Wu, and Bolei Zhou. Urbanverse: Scaling urban simulation by watching city-tour videos. In The Fourteenth International Conference on Learning Representations, 2026

  39. [39]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2024

  40. [40]

    Citywalker: Learning embodied urban navigation from web-scale videos

    Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, and Chen Feng. Citywalker: Learning embodied urban navigation from web-scale videos. arXiv preprint arXiv:2411.17820, 2024

  41. [41]

    Performance of optical flow techniques for indoor navigation with a mobile robot

    Chris McCarthy and Nick Bames. Performance of optical flow techniques for indoor navigation with a mobile robot. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, volume 5, pages 5093–5098. IEEE, 2004

  42. [42]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99– 106, 2021

  43. [43]

    Simple open-vocabulary object detection

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In European conference on computer vision, pages 728–755. Springer, 2022

  44. [44]

    Toward human-like social robot navigation: A large- scale, multi-modal, social human navigation dataset

    Duc M Nguyen, Mohammad Nazeri, Amirreza Payandeh, Aniket Datar, and Xuesu Xiao. Toward human-like social robot navigation: A large- scale, multi-modal, social human navigation dataset. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7442–7447. IEEE, 2023. 11 Akhtyamovet al.: EgoWalk: A Multimodal Dataset for Robot...

  45. [45]

    Visual navigation of mobile robot using optical flow and visual potential field

    Naoya Ohnishi and Atsushi Imiya. Visual navigation of mobile robot using optical flow and visual potential field. In International Workshop on Robot Vision, pages 412–426. Springer, 2008

  46. [46]

    Visual language navigation: A survey and open challenges

    Sang-Min Park and Young-Gab Kim. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, 56(1):365–427, 2023

  47. [47]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–

  48. [48]

    Do robots outperform humans in human-centered domains? Frontiers in Robotics and AI, 10:1223946, 2023

    Robert Riener, Luca Rabezzana, and Yves Zimmermann. Do robots outperform humans in human-centered domains? Frontiers in Robotics and AI, 10:1223946, 2023

  49. [49]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  50. [50]

    ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints

    Dhruv Shah and Sergey Levine. ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. In Proceedings of Robotics: Science and Systems, 2022

  51. [51]

    Gnm: A general navigation model to drive any robot

    Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023

  52. [52]

    Vint: A foundation model for visual navigation

    Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint: A foundation model for visual navigation. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors, Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Machine Learning Research, pages 711–733. PMLR, 06–09 Nov 2023

  53. [53]

    Nomad: Goal masked diffusion policies for navigation and exploration

    Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024

  54. [54]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019

  55. [55]

    Humanmimic: Learning natural locomotion and transitions for humanoid robot via wasserstein adversarial imitation

    Annan Tang, Takuma Hiraoka, Naoki Hiraoka, Fan Shi, Kento Kawa- harazuka, Kunio Kojima, Kei Okada, and Masayuki Inaba. Humanmimic: Learning natural locomotion and transitions for humanoid robot via wasserstein adversarial imitation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 13107–13114. IEEE, 2024

  56. [56]

    Open x-embodiment: Robotic learning datasets and rt-x models

    Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, et al. Open x-embodiment: Robotic learning datasets and rt-x models. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023

  57. [57]

    Sagar M. Waghmare, Kimberly Wilber, Dave Hawkey, Xuan Yang, Matthew Wilson, Stephanie Debats, Cattalyya Nuengsigkapian, Astuti Sharma, Lars Pandikow, Huisheng Wang, Hartwig Adam, and Mikhail Sirotenko. Sanpo: A scene understanding, accessibility and human navigation dataset. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pa...

  58. [58]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023

  59. [59]

    Vggt: Visual geometry grounded trans- former

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  60. [60]

    Difix3d+: Improving 3d reconstructions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26024–26035, 2025

  61. [61]

    Vision- language navigation: a survey and taxonomy

    Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision- language navigation: a survey and taxonomy. Neural Computing and Applications, 36(7):3291–3316, 2024

  62. [62]

    Segformer: Simple and efficient design for semantic segmentation with transformers

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090, 2021

  63. [63]

    Deep imitation learning for bimanual robotic manipulation

    Fan Xie, Alexander Chowdhury, M De Paolis Kaluza, Linfeng Zhao, Lawson Wong, and Rose Yu. Deep imitation learning for bimanual robotic manipulation. Advances in neural information processing systems, 33:2327– 2337, 2020

  64. [64]

    Vid2sim: Realistic and interactive simulation from video for urban navi- gation

    Ziyang Xie, Zhizheng Liu, Zhenghao Peng, Wayne Wu, and Bolei Zhou. Vid2sim: Realistic and interactive simulation from video for urban navi- gation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1581–1591, 2025

  65. [65]

    An accurate and real-time relative pose estimation from triple point-line images by decoupling rotation and translation

    Zewen Xu, Yijia He, Hao Wei, Bo Xu, BinJian Xie, and Yihong Wu. An accurate and real-time relative pose estimation from triple point-line images by decoupling rotation and translation. arXiv preprint arXiv:2403.11639, 2024

  66. [66]

    Generalized predictive model for autonomous driving

    Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14662– 14672, 2024

  67. [67]

    Autonomous visual navigation for mobile robots: A systematic literature review

    Yuri DV Yasuda, Luiz Eduardo G Martins, and Fabio AM Cappabianco. Autonomous visual navigation for mobile robots: A systematic literature review. ACM Computing Surveys (CSUR), 53(1):1–34, 2020

  68. [68]

    A survey on visual navigation for artificial agents with deep reinforcement learning

    Fanyu Zeng, Chen Wang, and Shuzhi Sam Ge. A survey on visual navigation for artificial agents with deep reinforcement learning. IEEE Access, 8:135426–135442, 2020

  69. [69]

    A survey of visual navigation: From geometry to embodied ai

    Tianyao Zhang, Xiaoguang Hu, Jin Xiao, and Guofeng Zhang. A survey of visual navigation: From geometry to embodied ai. Engineering Applications of Artificial Intelligence, 114:105036, 2022

  70. [70]

    Recognize anything: A strong image tagging model

    Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1724–1732, 2024

  71. [71]

    Unet++: A nested u-net architecture for medical image segmentation

    Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. In International workshop on deep learning in medical image analysis, pages 3–11. Springer, 2018

  72. [72]

    Vr-robo: A real-to-sim-to-real framework for visual robot navigation and locomotion

    Shaoting Zhu, Linzhan Mou, Derun Li, Baijun Ye, Runhan Huang, and Hang Zhao. Vr-robo: A real-to-sim-to-real framework for visual robot navigation and locomotion. IEEE Robotics and Automation Letters, 10(8):7875–7882, 2025. TIMUR AKHTY AMOVreceived the B.Sc. degree from Bauman Moscow State Technical University and the M.Sc. degree from the Skolkovo Institu...

  73. [73]

    His early research focused on hand prosthesis development and EEG-based control systems for prosthetic devices. He is currently a researcher at the Robotics Center, Moscow, Russia, where he works on the development and application of artificial intelligence methods for advanced robotic navigation and autonomous systems. His research interests include deep...