pith. machine review for the scientific record. sign in

arxiv: 2605.13321 · v1 · submitted 2026-05-13 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:01 UTC · model grok-4.3

classification 💻 cs.RO
keywords human-centric navigationvision-language navigationgeometric forecastingsemantic interpretationvision-language modelssocial navigationdynamic environments
0
0 comments X

The pith

HCSG lets robots navigate dynamic spaces by predicting human movements and understanding their intentions through combined geometry and language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces HCSG as a framework for vision-language navigation that accounts for moving people in indoor settings. It creates a module that forecasts human poses and paths while using a vision-language model to describe what humans are doing and why. These details get added to the robot's map so it can plan routes that follow instructions and respect social space. The approach matters because current methods treat people as simple obstacles, leading to more collisions and failed tasks in real homes or offices. Experiments show clear gains in success and fewer bumps when humans are actively modeled this way.

Core claim

HCSG provides a human-centric framework for VLN by introducing a unified Human Understanding Module that synergizes geometric forecasting of human pose and trajectory with semantic interpretation via a VLM to generate natural language descriptions of human actions and intentions. These representations are fused into the agent's topological map for instruction-conditioned planning, supported by a social distance loss, resulting in improved performance on the HA-VLNCE benchmark.

What carries the argument

The unified Human Understanding Module that combines geometric forecasting of poses and trajectories with VLM-generated semantic descriptions of intentions, fused into a topological map.

Load-bearing premise

The unified module can reliably predict accurate human poses, trajectories, and intention descriptions that improve planning in unseen real-world dynamic scenes.

What would settle it

Running the system in a controlled indoor environment with unpredictable pedestrian movements and measuring whether success rate stays above baseline levels or collision rate remains reduced.

Figures

Figures reproduced from arXiv: 2605.13321 by Haoang Li, Haoxuan Xu, Hesheng Wang, Huashuo Lei, Jin Wu, Lujia Wang, Tianfu Li, Wenbo Chen, Yi Liu, Yunfan Lou.

Figure 1
Figure 1. Figure 1: An illustrative navigation scenario in a human-populated environment. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Human-Centric Semantic-Geometric Reasoning framework for Vision-Language Navigation (HCSG). Starting from panoramic [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of the Human Geometric Reasoning Module. We decom [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of human-centric semantic-geometric reasoning. Given the human detected in the image sequence, the agent performs semantic reasoning [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of results on HA-VLNCE benchmark. (a) shows that [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of results on HA-VLNCE benchmark. (a) shows that [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of real-world deployment on the NXROBO Leo [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

VLN has achieved remarkable progress by scaling data and model capacity. However, the assumption of a static environment breaks down in real-world indoor scenarios, where robots inevitably encounter dynamic pedestrians. Existing human-aware approaches typically treat humans merely as moving obstacles based on implicit visual cues, lacking the explicit reasoning required to interpret human intentions or maintain social norms. To address this, we propose HCSG, the first human-centric framework for VLN. This framework provides a robust foundation for safe, socially intelligent navigation in dynamic human-robot environments that shifts the paradigm from passive collision avoidance to active human behavior understanding. Specifically, HCSG introduces a unified Human Understanding Module that synergizes two key capabilities: (i) geometric forecasting, which predicts human pose and trajectory to anticipate future motion dynamics; and (ii) semantic interpretation, which leverages a Vision-Language Model (VLM) to generate natural language descriptions of human actions and intentions. These semantic-geometric representations are fused into the agent's topological map for instruction-conditioned planning. Furthermore, a social distance loss is introduced to enforce socially compliant interaction distances. Extensive experiments on the HA-VLNCE benchmark demonstrate that HCSG significantly outperforms state-of-the-art methods, achieving a 14% improvement in Success Rate and a 34% reduction in Collision Rate. Our project can be seen at https://haoxuanxu1024.github.io/HCSG/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The paper proposes HCSG, the first human-centric framework for Vision-Language Navigation (VLN) in dynamic indoor environments. It introduces a unified Human Understanding Module that combines geometric forecasting of human poses and trajectories with semantic interpretation via a Vision-Language Model (VLM) to generate natural-language descriptions of human actions and intentions. These representations are fused into the agent's topological map for instruction-conditioned planning, with an added social distance loss to enforce compliant interaction distances. Experiments on the HA-VLNCE benchmark report a 14% improvement in Success Rate and 34% reduction in Collision Rate over state-of-the-art methods.

Significance. If the central claims hold after detailed verification, this work would be significant for shifting VLN from static-environment assumptions and passive obstacle avoidance toward explicit, active human behavior understanding. It addresses a practical gap in real-world robotics by integrating semantic and geometric cues, potentially improving safety and social compliance in dynamic scenes. The reported benchmark gains on HA-VLNCE suggest tangible advances, though their attribution to the claimed synergy remains to be substantiated.

major comments (4)
  1. [Abstract / Methods] Abstract and Methods: The fusion of semantic-geometric representations into the topological map is described only at a high level ('these semantic-geometric representations are fused'), with no equations, pseudocode, or architectural diagram specifying the integration operation (e.g., concatenation, attention, or learned weighting). This mechanism is load-bearing for the central claim that the synergy enables superior instruction-conditioned planning.
  2. [Results] Results: No ablation studies are reported that isolate the contribution of the fusion step versus the social distance loss or the underlying VLN backbone. Without such controls, the 14% SR and 34% CR gains cannot be confidently attributed to the proposed Human Understanding Module rather than stronger base components.
  3. [Methods] Methods: The paper provides no details on the internal architecture of the unified Human Understanding Module, including how geometric forecasts (pose/trajectory prediction) are combined with VLM-generated descriptions, nor any validation metrics for prediction accuracy in unseen dynamic scenes.
  4. [Experiments] Experiments: The abstract reports benchmark gains but omits baseline details, error bars, statistical significance tests, and the exact interaction between the social distance loss and the fused representations, preventing verification of the soundness of the 14% and 34% improvements.
minor comments (1)
  1. [Abstract] The project website link is provided, but the manuscript does not indicate whether code, models, or the HA-VLNCE benchmark splits will be released to support reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced Human Understanding Module and the social distance loss; these are presented as novel contributions without independent external validation in the abstract.

axioms (1)
  • domain assumption Existing VLN topological maps can be extended with human pose/trajectory forecasts and VLM-generated language descriptions without breaking instruction-conditioned planning.
    Invoked when the paper states that semantic-geometric representations are fused into the agent's topological map.
invented entities (1)
  • Human Understanding Module no independent evidence
    purpose: To unify geometric forecasting and semantic interpretation of human behavior for VLN planning.
    New module introduced by the paper; no independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5579 in / 1329 out tokens · 61626 ms · 2026-05-14T18:01:35.366623+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    BEVBert: Multimodal map pre-training for language-guided navigation,

    D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao, “BEVBert: Multimodal map pre-training for language-guided navigation,”Proceed- ings of the IEEE/CVF International Conference on Computer Vision, 2023

  2. [2]

    ETPNav: Evolving topological planning for vision-language navigation in continuous environments,

    D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “ETPNav: Evolving topological planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  3. [3]

    Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,

    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683

  4. [4]

    Room-Across- Room: Multilingual Vision-and-Language Navigation with Dense Spa- tiotemporal Grounding,

    A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-Across- Room: Multilingual Vision-and-Language Navigation with Dense Spa- tiotemporal Grounding,” inProc. Conf. Empirical Methods Nat. Lang. Process. (EMNLP), Nov. 2020, pp. 4392–4412

  5. [5]

    Beyond the Nav-Graph: Vision-and-language navigation in continuous environ- ments,

    J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the Nav-Graph: Vision-and-language navigation in continuous environ- ments,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 104–120

  6. [6]

    MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation,

    J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . K. Wong, “MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  7. [7]

    To boost zero- shot generalization for embodied reasoning with vision-language pre- training,

    K. Su, X. Zhang, S. Zhang, J. Zhu, and B. Zhang, “To boost zero- shot generalization for embodied reasoning with vision-language pre- training,”IEEE Transactions on Image Processing, vol. 33, pp. 5370– 5381, 2024

  8. [8]

    NavGPT: Explicit reasoning in vision- and-language navigation with large language models,

    G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7641–7649

  9. [9]

    Holistic lstm for pedestrian trajectory prediction,

    R. Quan, L. Zhu, Y . Wu, and Y . Yang, “Holistic lstm for pedestrian trajectory prediction,”IEEE transactions on image processing, vol. 30, pp. 3229–3239, 2021

  10. [10]

    HA-VLN: A benchmark for human-aware navigation in discrete-continuous environments with dynamic multi- human interactions, real-world validation, and an open leaderboard,

    Y . Dong, F. Wu, Q. He, H. Li, M. Li, Z. Cheng, Y . Zhou, J. Sun, Q. Dai, Z.-Q. Chenget al., “HA-VLN: A benchmark for human-aware navigation in discrete-continuous environments with dynamic multi- human interactions, real-world validation, and an open leaderboard,” arXiv preprint arXiv:2503.14229, 2025

  11. [11]

    Habitat 2.0: Training home assistants to rearrange their habitat,

    A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” inAdvances in Neural Information Processing Sys...

  12. [12]

    Habitat: A Platform for Embodied AI Research,

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  13. [13]

    Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots,

    X. Puig, E. Undersander, A. Szot, M. D. Cote, R. Partsey, J. Yang, R. Desai, A. W. Clegg, M. Hlavac, T. Min, T. Gervet, V . V ondruˇs, V .-P. Berges, J. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi, “Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots,” inInternational Conference ...

  14. [14]

    DRAGON: A dialogue-based robot for assistive navigation with visual language grounding,

    S. Liu, A. Hasan, K. Hong, R. Wang, P. Chang, Z. Mizrachi, J. Lin, D. L. McPherson, W. A. Rogers, and K. Driggs-Campbell, “DRAGON: A dialogue-based robot for assistive navigation with visual language grounding,”IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3712–3719, 2024

  15. [15]

    CoRI: Communication of robot intent for physical human-robot interaction,

    J. Wang, E. B. K ¨uc ¸¨uktabak, R. S. Zarrin, and Z. Erickson, “CoRI: Communication of robot intent for physical human-robot interaction,” in9th Annual Conference on Robot Learning, 2025

  16. [16]

    Social-LLaVa: Enhancing robot navigation through human-language reasoning in social spaces,

    A. Payandeh, D. Song, M. Nazeri, J. Liang, P. Mukherjee, A. H. Raj, Y . Kong, D. Manocha, and X. Xiao, “Social-LLaVa: Enhancing robot navigation through human-language reasoning in social spaces,”arXiv preprint arXiv:2501.09024, 2024

  17. [17]

    VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,

    D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, vol. 10, no. 1, pp. 508–515, 2025

  18. [18]

    Bridging the gap be- tween learning in discrete and continuous environments for vision- and-language navigation,

    Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap be- tween learning in discrete and continuous environments for vision- and-language navigation,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 418–15 428

  19. [19]

    Neighbor-view enhanced model for vision and language navigation,

    D. An, Y . Qi, Y . Huang, Q. Wu, L. Wang, and T. Tan, “Neighbor-view enhanced model for vision and language navigation,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5101–5109. 11

  20. [20]

    Unbiased directed object attention graph for object navigation,

    R. Dang, Z. Shi, L. Wang, Z. He, C. Liu, and Q. Chen, “Unbiased directed object attention graph for object navigation,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3617–3627

  21. [21]

    A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation,

    Z. He, L. Wang, S. Li, Q. Yan, C. Liu, and Q. Chen, “A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation,”Applied Intelligence, vol. 55, no. 7, 2025

  22. [22]

    VLN⟳BERT: A recurrent vision-and-language bert for navigation,

    Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “VLN⟳BERT: A recurrent vision-and-language bert for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653

  23. [23]

    Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,

    X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6629–6638

  24. [24]

    P 3Nav: End-to-end perception, prediction and planning for vision-and-language navigation,

    T. Li, W. Chen, H. Xu, X. Zheng, and H. Li, “P 3Nav: End-to-end perception, prediction and planning for vision-and-language navigation,” arXiv preprint arXiv:2603.17459, 2026

  25. [25]

    EnvEdit: Environment editing for vision- and-language navigation,

    J. Li, H. Tan, and M. Bansal, “EnvEdit: Environment editing for vision- and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 407–15 417

  26. [26]

    Scaling data generation in vision-and-language navigation,

    Z. Wang, J. Li, Y . Hong, Y . Wang, Q. Wu, M. Bansal, S. Gould, H. Tan, and Y . Qiao, “Scaling data generation in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 009–12 020

  27. [27]

    Speaker- follower models for vision-and-language navigation,

    D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,”Advances in Neu- ral Information Processing Systems, vol. 31, 2018

  28. [28]

    Learning to Navigate Unseen Environ- ments: Back Translation with Environmental Dropout,

    H. Tan, L. Yu, and M. Bansal, “Learning to Navigate Unseen Environ- ments: Back Translation with Environmental Dropout,” inConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 2610–2621

  29. [29]

    Enhancing vision-language navigation with multimodal event knowledge from real- world indoor tour videos,

    H. Xu, T. Li, W. Chen, Y . Liu, X. Zuo, Y . Song, and H. Li, “Enhancing vision-language navigation with multimodal event knowledge from real- world indoor tour videos,” 2026

  30. [30]

    AirBERT: In-domain pretraining for vision-and-language navigation,

    P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “AirBERT: In-domain pretraining for vision-and-language navigation,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1614– 1623

  31. [31]

    Towards learning a generic agent for vision-and-language navigation via pre-training,

    W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 137–13 146

  32. [32]

    Transferable representation learning in vision-and-language navigation,

    H. Huang, V . Jain, H. Mehta, A. Ku, G. Magalhaes, J. Baldridge, and E. Ie, “Transferable representation learning in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7404–7413

  33. [33]

    Improving vision-and-language navigation with image-text pairs from the web,

    A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 259–274

  34. [34]

    HOP+: History- enhanced and order-aware pre-training for vision-and-language naviga- tion,

    Y . Qiao, Y . Qi, Y . Hong, Z. Yu, P. Wang, and Q. Wu, “HOP+: History- enhanced and order-aware pre-training for vision-and-language naviga- tion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8524–8537, 2023

  35. [35]

    ThinkMat- ter: Panoramic-aware instructional semantics for monocular vision-and- language navigation,

    G. Dai, S. Wang, H. Zhao, B. Zhu, Q. Sun, and X. Shu, “ThinkMat- ter: Panoramic-aware instructional semantics for monocular vision-and- language navigation,”IEEE Transactions on Image Processing, 2026

  36. [36]

    LangLoc: Language-driven localization via formatted spatial description genera- tion,

    W. Shi, C. Chen, K. Li, Y . Xiong, X. Cao, and Z. Zhou, “LangLoc: Language-driven localization via formatted spatial description genera- tion,”IEEE Transactions on Image Processing, 2025

  37. [37]

    Think Global, Act Local: Dual-scale graph transformer for vision-and-language navigation,

    S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think Global, Act Local: Dual-scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 537–16 547

  38. [38]

    GridMM: Grid memory map for vision-and-language navigation,

    Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “GridMM: Grid memory map for vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 625–15 636

  39. [39]

    V olumetric environment representation for vision-language navigation,

    R. Liu, W. Wang, and Y . Yang, “V olumetric environment representation for vision-language navigation,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 16 317– 16 328

  40. [40]

    Com- prehensive attribute prediction learning for person search by language,

    K. Niu, L. Huang, Y . Long, Y . Huang, L. Wang, and Y . Zhang, “Com- prehensive attribute prediction learning for person search by language,” IEEE Transactions on Image Processing, vol. 33, pp. 1990–2003, 2024

  41. [41]

    TrajFine: Predicted trajectory refinement for pedestrian trajectory fore- casting,

    K.-L. Wang, L.-W. Tsao, J.-C. Wu, H.-H. Shuai, and W.-H. Cheng, “TrajFine: Predicted trajectory refinement for pedestrian trajectory fore- casting,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 4483–4492

  42. [42]

    PoseScript: 3d human poses from natural language,

    G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno-Noguer, and G. Rogez, “PoseScript: 3d human poses from natural language,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 346–362

  43. [43]

    MotionLLM: Understanding human behaviors from human motions and videos,

    L.-H. Chen, S. Lu, A. Zeng, H. Zhang, B. Wang, R. Zhang, and L. Zhang, “MotionLLM: Understanding human behaviors from human motions and videos,”IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pp. 1–15, 2025

  44. [44]

    Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interac- tions,

    H. Li, M. Li, Z.-Q. Cheng, Y . Dong, Y . Zhou, J.-Y . He, Q. Dai, T. Mitamura, and A. G. Hauptmann, “Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interac- tions,”Advances in Neural Information Processing Systems, vol. 37, pp. 119 411–119 442, 2024

  45. [45]

    Decouple ego-view motions for predicting pedestrian trajectory and intention,

    Z. Zhang, Z. Ding, and R. Tian, “Decouple ego-view motions for predicting pedestrian trajectory and intention,”IEEE Transactions on Image Processing, vol. 33, pp. 4716–4727, 2024

  46. [46]

    Rethinking Social Robot Navigation: Leverag- ing the best of two worlds,

    A. H. Raj, Z. Hu, H. Karnan, R. Chandra, A. Payandeh, L. Mao, P. Stone, J. Biswas, and X. Xiao, “Rethinking Social Robot Navigation: Leverag- ing the best of two worlds,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 16 330–16 337

  47. [47]

    Social elastic band with prediction and anticipation: Enhancing real-time path tra- jectory optimization for socially aware robot navigation,

    G. P ´erez, N. Zapata-Cornejo, P. Bustos, and P. N ´u˜nez, “Social elastic band with prediction and anticipation: Enhancing real-time path tra- jectory optimization for socially aware robot navigation,”International Journal of Social Robotics, vol. 17, no. 10, pp. 2041–2063, 2025

  48. [48]

    SICNav: Safe and interactive crowd navigation using model predictive control and bilevel optimization,

    S. Samavi, J. R. Han, F. Shkurti, and A. P. Schoellig, “SICNav: Safe and interactive crowd navigation using model predictive control and bilevel optimization,”IEEE Transactions on Robotics, vol. 41, p. 801–818, 2025

  49. [49]

    SCSV: Spatial-temporal consistent dynamic 3d scene generation from sparse views,

    J. Li, J. He, W. Liu, T. Huang, S. Zhou, J. Ma, H. Wang, and H. Li, “SCSV: Spatial-temporal consistent dynamic 3d scene generation from sparse views,”IEEE Transactions on Image Processing, 2026

  50. [50]

    On legible and pre- dictable robot navigation in multi-agent environments,

    J.-L. Bastarache, C. Nielsen, and S. L. Smith, “On legible and pre- dictable robot navigation in multi-agent environments,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 5508–5514

  51. [51]

    Crowd-aware socially compliant robot navigation via deep reinforcement learning,

    B. Xue, M. Gao, C. Wang, Y . Cheng, and F. Zhou, “Crowd-aware socially compliant robot navigation via deep reinforcement learning,” International Journal of Social Robotics, vol. 16, no. 1, pp. 197–209, 2024

  52. [52]

    Socially aware robot crowd navigation via online uncertainty-driven risk adaptation,

    Z. Sun, X. Diao, Y . Wang, B.-K. Zhu, and J. Wang, “Socially aware robot crowd navigation via online uncertainty-driven risk adaptation,” arXiv preprint arXiv:2506.14305, 2025

  53. [53]

    From cognition to precognition: A future-aware framework for social navigation,

    Z. Gong, T. Hu, R. Qiu, and J. Liang, “From cognition to precognition: A future-aware framework for social navigation,” in2025 IEEE Inter- national Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9122–9129

  54. [54]

    Hartley and A

    R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003

  55. [55]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  56. [56]

    Language- aligned waypoint (law) supervision for vision-and-language navigation in continuous environments,

    S. Raychaudhuri, S. Wani, S. Patel, U. Jain, and A. Chang, “Language- aligned waypoint (law) supervision for vision-and-language navigation in continuous environments,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 4018–4028

  57. [57]

    YOLO-Pose: Enhanc- ing yolo for multi person pose estimation using object keypoint simi- larity loss,

    D. Maji, S. Nagori, M. Mathew, and D. Poddar, “YOLO-Pose: Enhanc- ing yolo for multi person pose estimation using object keypoint simi- larity loss,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2637–2646

  58. [58]

    Qwen3 Technical Report

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388