arxiv: 2605.13321 · v1 · submitted 2026-05-13 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation

Haoxuan Xu , Tianfu Li , Wenbo Chen , Yi Liu , Jin Wu , Huashuo Lei , Yunfan Lou , Lujia Wang

show 2 more authors

Hesheng Wang Haoang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:01 UTC · model grok-4.3

classification 💻 cs.RO

keywords human-centric navigationvision-language navigationgeometric forecastingsemantic interpretationvision-language modelssocial navigationdynamic environments

0 comments

The pith

HCSG lets robots navigate dynamic spaces by predicting human movements and understanding their intentions through combined geometry and language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces HCSG as a framework for vision-language navigation that accounts for moving people in indoor settings. It creates a module that forecasts human poses and paths while using a vision-language model to describe what humans are doing and why. These details get added to the robot's map so it can plan routes that follow instructions and respect social space. The approach matters because current methods treat people as simple obstacles, leading to more collisions and failed tasks in real homes or offices. Experiments show clear gains in success and fewer bumps when humans are actively modeled this way.

Core claim

HCSG provides a human-centric framework for VLN by introducing a unified Human Understanding Module that synergizes geometric forecasting of human pose and trajectory with semantic interpretation via a VLM to generate natural language descriptions of human actions and intentions. These representations are fused into the agent's topological map for instruction-conditioned planning, supported by a social distance loss, resulting in improved performance on the HA-VLNCE benchmark.

What carries the argument

The unified Human Understanding Module that combines geometric forecasting of poses and trajectories with VLM-generated semantic descriptions of intentions, fused into a topological map.

Load-bearing premise

The unified module can reliably predict accurate human poses, trajectories, and intention descriptions that improve planning in unseen real-world dynamic scenes.

What would settle it

Running the system in a controlled indoor environment with unpredictable pedestrian movements and measuring whether success rate stays above baseline levels or collision rate remains reduced.

Figures

Figures reproduced from arXiv: 2605.13321 by Haoang Li, Haoxuan Xu, Hesheng Wang, Huashuo Lei, Jin Wu, Lujia Wang, Tianfu Li, Wenbo Chen, Yi Liu, Yunfan Lou.

**Figure 2.** Figure 2: Overview of the proposed Human-Centric Semantic-Geometric Reasoning framework for Vision-Language Navigation (HCSG). Starting from panoramic [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of the Human Geometric Reasoning Module. We decom [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of human-centric semantic-geometric reasoning. Given the human detected in the image sequence, the agent performs semantic reasoning [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of results on HA-VLNCE benchmark. (a) shows that [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of results on HA-VLNCE benchmark. (a) shows that [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of real-world deployment on the NXROBO Leo [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

VLN has achieved remarkable progress by scaling data and model capacity. However, the assumption of a static environment breaks down in real-world indoor scenarios, where robots inevitably encounter dynamic pedestrians. Existing human-aware approaches typically treat humans merely as moving obstacles based on implicit visual cues, lacking the explicit reasoning required to interpret human intentions or maintain social norms. To address this, we propose HCSG, the first human-centric framework for VLN. This framework provides a robust foundation for safe, socially intelligent navigation in dynamic human-robot environments that shifts the paradigm from passive collision avoidance to active human behavior understanding. Specifically, HCSG introduces a unified Human Understanding Module that synergizes two key capabilities: (i) geometric forecasting, which predicts human pose and trajectory to anticipate future motion dynamics; and (ii) semantic interpretation, which leverages a Vision-Language Model (VLM) to generate natural language descriptions of human actions and intentions. These semantic-geometric representations are fused into the agent's topological map for instruction-conditioned planning. Furthermore, a social distance loss is introduced to enforce socially compliant interaction distances. Extensive experiments on the HA-VLNCE benchmark demonstrate that HCSG significantly outperforms state-of-the-art methods, achieving a 14% improvement in Success Rate and a 34% reduction in Collision Rate. Our project can be seen at https://haoxuanxu1024.github.io/HCSG/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HCSG adds an explicit Human Understanding Module that pairs geometric human pose/trajectory forecasting with VLM-generated intent descriptions, then fuses them into the topological map plus a social distance loss, and reports 14% SR gains and 34% collision drops on HA-VLNCE.

read the letter

The core contribution is a unified module that does two things at once: it forecasts where people will move using geometric models, and it uses a VLM to turn visual observations into natural-language descriptions of what those people are doing or planning. Those two streams get combined and dropped into the agent's topological map so the planner can condition on both the geometry and the semantics while also penalizing socially bad distances. That is a clear step past the usual approach of treating humans as generic moving obstacles detected only by visual cues. The abstract positions this as the first such explicit fusion for VLN, and the benchmark numbers look plausible for a dynamic indoor setting. If the full experiments back it up with proper ablations, it gives a concrete handle on socially compliant navigation that service robots actually need. The main soft spot is that the fusion step itself is described at a high level. It is not obvious from the abstract whether the semantic and geometric features are concatenated, attended, or combined through some learned gate, and there is no indication yet of an ablation that isolates the contribution of the VLM descriptions versus the geometric forecasts alone. The social distance loss could be carrying a lot of the reported improvement. Without seeing the exact baseline implementations, error bars, or how the HA-VLNCE episodes were constructed, it is hard to judge whether the 14% and 34% figures are robust or sensitive to small changes in the setup. This is the kind of paper that belongs in a reading group for people working on real-world VLN or human-robot interaction in indoor spaces. The idea is grounded enough and the claims are falsifiable, so it deserves a serious referee even if the fusion details and ablations will need tightening. I would not cite it yet but would want to see the camera-ready version.

Referee Report

4 major / 1 minor

Summary. The paper proposes HCSG, the first human-centric framework for Vision-Language Navigation (VLN) in dynamic indoor environments. It introduces a unified Human Understanding Module that combines geometric forecasting of human poses and trajectories with semantic interpretation via a Vision-Language Model (VLM) to generate natural-language descriptions of human actions and intentions. These representations are fused into the agent's topological map for instruction-conditioned planning, with an added social distance loss to enforce compliant interaction distances. Experiments on the HA-VLNCE benchmark report a 14% improvement in Success Rate and 34% reduction in Collision Rate over state-of-the-art methods.

Significance. If the central claims hold after detailed verification, this work would be significant for shifting VLN from static-environment assumptions and passive obstacle avoidance toward explicit, active human behavior understanding. It addresses a practical gap in real-world robotics by integrating semantic and geometric cues, potentially improving safety and social compliance in dynamic scenes. The reported benchmark gains on HA-VLNCE suggest tangible advances, though their attribution to the claimed synergy remains to be substantiated.

major comments (4)

[Abstract / Methods] Abstract and Methods: The fusion of semantic-geometric representations into the topological map is described only at a high level ('these semantic-geometric representations are fused'), with no equations, pseudocode, or architectural diagram specifying the integration operation (e.g., concatenation, attention, or learned weighting). This mechanism is load-bearing for the central claim that the synergy enables superior instruction-conditioned planning.
[Results] Results: No ablation studies are reported that isolate the contribution of the fusion step versus the social distance loss or the underlying VLN backbone. Without such controls, the 14% SR and 34% CR gains cannot be confidently attributed to the proposed Human Understanding Module rather than stronger base components.
[Methods] Methods: The paper provides no details on the internal architecture of the unified Human Understanding Module, including how geometric forecasts (pose/trajectory prediction) are combined with VLM-generated descriptions, nor any validation metrics for prediction accuracy in unseen dynamic scenes.
[Experiments] Experiments: The abstract reports benchmark gains but omits baseline details, error bars, statistical significance tests, and the exact interaction between the social distance loss and the fused representations, preventing verification of the soundness of the 14% and 34% improvements.

minor comments (1)

[Abstract] The project website link is provided, but the manuscript does not indicate whether code, models, or the HA-VLNCE benchmark splits will be released to support reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced Human Understanding Module and the social distance loss; these are presented as novel contributions without independent external validation in the abstract.

axioms (1)

domain assumption Existing VLN topological maps can be extended with human pose/trajectory forecasts and VLM-generated language descriptions without breaking instruction-conditioned planning.
Invoked when the paper states that semantic-geometric representations are fused into the agent's topological map.

invented entities (1)

Human Understanding Module no independent evidence
purpose: To unify geometric forecasting and semantic interpretation of human behavior for VLN planning.
New module introduced by the paper; no independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5579 in / 1329 out tokens · 61626 ms · 2026-05-14T18:01:35.366623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semantic-geometric representations are fused into the agent's topological map ... Social Distance Loss ... Ltotal = Lpose + Ltraj + Lcoll + Lprox + Lnav (Eq. 13)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 5 canonical work pages · 1 internal anchor

[1]

BEVBert: Multimodal map pre-training for language-guided navigation,

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao, “BEVBert: Multimodal map pre-training for language-guided navigation,”Proceed- ings of the IEEE/CVF International Conference on Computer Vision, 2023

2023
[2]

ETPNav: Evolving topological planning for vision-language navigation in continuous environments,

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “ETPNav: Evolving topological planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[3]

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683

2018
[4]

Room-Across- Room: Multilingual Vision-and-Language Navigation with Dense Spa- tiotemporal Grounding,

A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-Across- Room: Multilingual Vision-and-Language Navigation with Dense Spa- tiotemporal Grounding,” inProc. Conf. Empirical Methods Nat. Lang. Process. (EMNLP), Nov. 2020, pp. 4392–4412

2020
[5]

Beyond the Nav-Graph: Vision-and-language navigation in continuous environ- ments,

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the Nav-Graph: Vision-and-language navigation in continuous environ- ments,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 104–120

2020
[6]

MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation,

J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . K. Wong, “MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[7]

To boost zero- shot generalization for embodied reasoning with vision-language pre- training,

K. Su, X. Zhang, S. Zhang, J. Zhu, and B. Zhang, “To boost zero- shot generalization for embodied reasoning with vision-language pre- training,”IEEE Transactions on Image Processing, vol. 33, pp. 5370– 5381, 2024

2024
[8]

NavGPT: Explicit reasoning in vision- and-language navigation with large language models,

G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7641–7649

2024
[9]

Holistic lstm for pedestrian trajectory prediction,

R. Quan, L. Zhu, Y . Wu, and Y . Yang, “Holistic lstm for pedestrian trajectory prediction,”IEEE transactions on image processing, vol. 30, pp. 3229–3239, 2021

2021
[10]

HA-VLN: A benchmark for human-aware navigation in discrete-continuous environments with dynamic multi- human interactions, real-world validation, and an open leaderboard,

Y . Dong, F. Wu, Q. He, H. Li, M. Li, Z. Cheng, Y . Zhou, J. Sun, Q. Dai, Z.-Q. Chenget al., “HA-VLN: A benchmark for human-aware navigation in discrete-continuous environments with dynamic multi- human interactions, real-world validation, and an open leaderboard,” arXiv preprint arXiv:2503.14229, 2025

work page arXiv 2025
[11]

Habitat 2.0: Training home assistants to rearrange their habitat,

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” inAdvances in Neural Information Processing Sys...

2021
[12]

Habitat: A Platform for Embodied AI Research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019
[13]

Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots,

X. Puig, E. Undersander, A. Szot, M. D. Cote, R. Partsey, J. Yang, R. Desai, A. W. Clegg, M. Hlavac, T. Min, T. Gervet, V . V ondruˇs, V .-P. Berges, J. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi, “Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots,” inInternational Conference ...

2024
[14]

DRAGON: A dialogue-based robot for assistive navigation with visual language grounding,

S. Liu, A. Hasan, K. Hong, R. Wang, P. Chang, Z. Mizrachi, J. Lin, D. L. McPherson, W. A. Rogers, and K. Driggs-Campbell, “DRAGON: A dialogue-based robot for assistive navigation with visual language grounding,”IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3712–3719, 2024

2024
[15]

CoRI: Communication of robot intent for physical human-robot interaction,

J. Wang, E. B. K ¨uc ¸¨uktabak, R. S. Zarrin, and Z. Erickson, “CoRI: Communication of robot intent for physical human-robot interaction,” in9th Annual Conference on Robot Learning, 2025

2025
[16]

Social-LLaVa: Enhancing robot navigation through human-language reasoning in social spaces,

A. Payandeh, D. Song, M. Nazeri, J. Liang, P. Mukherjee, A. H. Raj, Y . Kong, D. Manocha, and X. Xiao, “Social-LLaVa: Enhancing robot navigation through human-language reasoning in social spaces,”arXiv preprint arXiv:2501.09024, 2024

work page arXiv 2024
[17]

VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,

D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, vol. 10, no. 1, pp. 508–515, 2025

2025
[18]

Bridging the gap be- tween learning in discrete and continuous environments for vision- and-language navigation,

Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap be- tween learning in discrete and continuous environments for vision- and-language navigation,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 418–15 428

2022
[19]

Neighbor-view enhanced model for vision and language navigation,

D. An, Y . Qi, Y . Huang, Q. Wu, L. Wang, and T. Tan, “Neighbor-view enhanced model for vision and language navigation,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5101–5109. 11

2021
[20]

Unbiased directed object attention graph for object navigation,

R. Dang, Z. Shi, L. Wang, Z. He, C. Liu, and Q. Chen, “Unbiased directed object attention graph for object navigation,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3617–3627

2022
[21]

A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation,

Z. He, L. Wang, S. Li, Q. Yan, C. Liu, and Q. Chen, “A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation,”Applied Intelligence, vol. 55, no. 7, 2025

2025
[22]

VLN⟳BERT: A recurrent vision-and-language bert for navigation,

Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “VLN⟳BERT: A recurrent vision-and-language bert for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653

2021
[23]

Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,

X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6629–6638

2019
[24]

P 3Nav: End-to-end perception, prediction and planning for vision-and-language navigation,

T. Li, W. Chen, H. Xu, X. Zheng, and H. Li, “P 3Nav: End-to-end perception, prediction and planning for vision-and-language navigation,” arXiv preprint arXiv:2603.17459, 2026

work page arXiv 2026
[25]

EnvEdit: Environment editing for vision- and-language navigation,

J. Li, H. Tan, and M. Bansal, “EnvEdit: Environment editing for vision- and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 407–15 417

2022
[26]

Scaling data generation in vision-and-language navigation,

Z. Wang, J. Li, Y . Hong, Y . Wang, Q. Wu, M. Bansal, S. Gould, H. Tan, and Y . Qiao, “Scaling data generation in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 009–12 020

2023
[27]

Speaker- follower models for vision-and-language navigation,

D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,”Advances in Neu- ral Information Processing Systems, vol. 31, 2018

2018
[28]

Learning to Navigate Unseen Environ- ments: Back Translation with Environmental Dropout,

H. Tan, L. Yu, and M. Bansal, “Learning to Navigate Unseen Environ- ments: Back Translation with Environmental Dropout,” inConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 2610–2621

2019
[29]

Enhancing vision-language navigation with multimodal event knowledge from real- world indoor tour videos,

H. Xu, T. Li, W. Chen, Y . Liu, X. Zuo, Y . Song, and H. Li, “Enhancing vision-language navigation with multimodal event knowledge from real- world indoor tour videos,” 2026

2026
[30]

AirBERT: In-domain pretraining for vision-and-language navigation,

P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “AirBERT: In-domain pretraining for vision-and-language navigation,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1614– 1623

2021
[31]

Towards learning a generic agent for vision-and-language navigation via pre-training,

W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 137–13 146

2020
[32]

Transferable representation learning in vision-and-language navigation,

H. Huang, V . Jain, H. Mehta, A. Ku, G. Magalhaes, J. Baldridge, and E. Ie, “Transferable representation learning in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7404–7413

2019
[33]

Improving vision-and-language navigation with image-text pairs from the web,

A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 259–274

2020
[34]

HOP+: History- enhanced and order-aware pre-training for vision-and-language naviga- tion,

Y . Qiao, Y . Qi, Y . Hong, Z. Yu, P. Wang, and Q. Wu, “HOP+: History- enhanced and order-aware pre-training for vision-and-language naviga- tion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8524–8537, 2023

2023
[35]

ThinkMat- ter: Panoramic-aware instructional semantics for monocular vision-and- language navigation,

G. Dai, S. Wang, H. Zhao, B. Zhu, Q. Sun, and X. Shu, “ThinkMat- ter: Panoramic-aware instructional semantics for monocular vision-and- language navigation,”IEEE Transactions on Image Processing, 2026

2026
[36]

LangLoc: Language-driven localization via formatted spatial description genera- tion,

W. Shi, C. Chen, K. Li, Y . Xiong, X. Cao, and Z. Zhou, “LangLoc: Language-driven localization via formatted spatial description genera- tion,”IEEE Transactions on Image Processing, 2025

2025
[37]

Think Global, Act Local: Dual-scale graph transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think Global, Act Local: Dual-scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 537–16 547

2022
[38]

GridMM: Grid memory map for vision-and-language navigation,

Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “GridMM: Grid memory map for vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 625–15 636

2023
[39]

V olumetric environment representation for vision-language navigation,

R. Liu, W. Wang, and Y . Yang, “V olumetric environment representation for vision-language navigation,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 16 317– 16 328

2024
[40]

Com- prehensive attribute prediction learning for person search by language,

K. Niu, L. Huang, Y . Long, Y . Huang, L. Wang, and Y . Zhang, “Com- prehensive attribute prediction learning for person search by language,” IEEE Transactions on Image Processing, vol. 33, pp. 1990–2003, 2024

1990
[41]

TrajFine: Predicted trajectory refinement for pedestrian trajectory fore- casting,

K.-L. Wang, L.-W. Tsao, J.-C. Wu, H.-H. Shuai, and W.-H. Cheng, “TrajFine: Predicted trajectory refinement for pedestrian trajectory fore- casting,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 4483–4492

2024
[42]

PoseScript: 3d human poses from natural language,

G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno-Noguer, and G. Rogez, “PoseScript: 3d human poses from natural language,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 346–362

2022
[43]

MotionLLM: Understanding human behaviors from human motions and videos,

L.-H. Chen, S. Lu, A. Zeng, H. Zhang, B. Wang, R. Zhang, and L. Zhang, “MotionLLM: Understanding human behaviors from human motions and videos,”IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pp. 1–15, 2025

2025
[44]

Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interac- tions,

H. Li, M. Li, Z.-Q. Cheng, Y . Dong, Y . Zhou, J.-Y . He, Q. Dai, T. Mitamura, and A. G. Hauptmann, “Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interac- tions,”Advances in Neural Information Processing Systems, vol. 37, pp. 119 411–119 442, 2024

2024
[45]

Decouple ego-view motions for predicting pedestrian trajectory and intention,

Z. Zhang, Z. Ding, and R. Tian, “Decouple ego-view motions for predicting pedestrian trajectory and intention,”IEEE Transactions on Image Processing, vol. 33, pp. 4716–4727, 2024

2024
[46]

Rethinking Social Robot Navigation: Leverag- ing the best of two worlds,

A. H. Raj, Z. Hu, H. Karnan, R. Chandra, A. Payandeh, L. Mao, P. Stone, J. Biswas, and X. Xiao, “Rethinking Social Robot Navigation: Leverag- ing the best of two worlds,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 16 330–16 337

2024
[47]

Social elastic band with prediction and anticipation: Enhancing real-time path tra- jectory optimization for socially aware robot navigation,

G. P ´erez, N. Zapata-Cornejo, P. Bustos, and P. N ´u˜nez, “Social elastic band with prediction and anticipation: Enhancing real-time path tra- jectory optimization for socially aware robot navigation,”International Journal of Social Robotics, vol. 17, no. 10, pp. 2041–2063, 2025

2041
[48]

SICNav: Safe and interactive crowd navigation using model predictive control and bilevel optimization,

S. Samavi, J. R. Han, F. Shkurti, and A. P. Schoellig, “SICNav: Safe and interactive crowd navigation using model predictive control and bilevel optimization,”IEEE Transactions on Robotics, vol. 41, p. 801–818, 2025

2025
[49]

SCSV: Spatial-temporal consistent dynamic 3d scene generation from sparse views,

J. Li, J. He, W. Liu, T. Huang, S. Zhou, J. Ma, H. Wang, and H. Li, “SCSV: Spatial-temporal consistent dynamic 3d scene generation from sparse views,”IEEE Transactions on Image Processing, 2026

2026
[50]

On legible and pre- dictable robot navigation in multi-agent environments,

J.-L. Bastarache, C. Nielsen, and S. L. Smith, “On legible and pre- dictable robot navigation in multi-agent environments,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 5508–5514

2023
[51]

Crowd-aware socially compliant robot navigation via deep reinforcement learning,

B. Xue, M. Gao, C. Wang, Y . Cheng, and F. Zhou, “Crowd-aware socially compliant robot navigation via deep reinforcement learning,” International Journal of Social Robotics, vol. 16, no. 1, pp. 197–209, 2024

2024
[52]

Socially aware robot crowd navigation via online uncertainty-driven risk adaptation,

Z. Sun, X. Diao, Y . Wang, B.-K. Zhu, and J. Wang, “Socially aware robot crowd navigation via online uncertainty-driven risk adaptation,” arXiv preprint arXiv:2506.14305, 2025

work page arXiv 2025
[53]

From cognition to precognition: A future-aware framework for social navigation,

Z. Gong, T. Hu, R. Qiu, and J. Liang, “From cognition to precognition: A future-aware framework for social navigation,” in2025 IEEE Inter- national Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9122–9129

2025
[54]

Hartley and A

R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003

2003
[55]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[56]

Language- aligned waypoint (law) supervision for vision-and-language navigation in continuous environments,

S. Raychaudhuri, S. Wani, S. Patel, U. Jain, and A. Chang, “Language- aligned waypoint (law) supervision for vision-and-language navigation in continuous environments,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 4018–4028

2021
[57]

YOLO-Pose: Enhanc- ing yolo for multi person pose estimation using object keypoint simi- larity loss,

D. Maji, S. Nagori, M. Mathew, and D. Poddar, “YOLO-Pose: Enhanc- ing yolo for multi person pose estimation using object keypoint simi- larity loss,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2637–2646

2022
[58]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025