arxiv: 2605.11782 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

Antoni Valls , Jordi Sanchez-Riera

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual question answeringvision language modelsassistive navigationhazard detectionrisk mappinglow visionurban environmentsmultimodal models

0 comments

The pith

Vision-language models enable risk-aware urban navigation maps for people with low vision through visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a system that uses visual question answering with vision-language models to describe pedestrian scenes and identify hazards in cities. By applying a three-level query structure and aggregating answers into risk scores, it classifies street segments into four safety categories for navigation planning. The approach is tested on a new dataset covering 20 cities worldwide, revealing that generative multimodal models like Qwen-VL provide better precision and recall than classification methods, offering a flexible way to assist visually impaired people without custom training.

Core claim

Generative Multimodal Large Language Models substantially outperform classification-based approaches in VQA for hazard identification, with Qwen-VL achieving the best balance of precision and recall. This supports the creation of navigable risk-aware event maps from aggregated model responses using a hierarchical query structure, demonstrating viability as a foundation for assistive navigation systems.

What carries the argument

VQA-based event map framework with three-level hierarchical queries on VLMs aggregated via weighted risk scoring into four safety categories.

Load-bearing premise

Responses from off-the-shelf VLMs can be reliably aggregated into accurate four-category safety labels that generalize to new urban settings without additional training or validation.

What would settle it

A field test showing that the generated risk maps frequently misclassify hazards in cities outside the 20-city dataset, leading to unsafe navigation suggestions.

Figures

Figures reproduced from arXiv: 2605.11782 by Antoni Valls, Jordi Sanchez-Riera.

**Figure 2.** Figure 2: Example of the hierarchical multicategory query structure application for a street scene in Buenos Aires. The hierarchy begins with Level-1 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Geographic distribution of the dataset across 20 cities spanning six [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Risk event maps generated by Qwen-VL for routes in Barcelona [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New 20-city dataset and hierarchical VQA pipeline for low-vision navigation aids, but the weighted aggregation into safety categories has no reported validation against real risk labels.

read the letter

The paper's main contribution is a geographically broad dataset covering 20 cities on six continents, with over 800 images and 18,000 VQA annotations, paired with a three-level query structure that turns off-the-shelf VLMs into event maps for route planning. They benchmark ViLT, LLaVA, InstructBLIP, and Qwen-VL on the questions and show that the generative MLLMs, especially Qwen-VL, give better precision-recall balance than the classification-style models. That setup avoids task-specific retraining, which is a practical plus for assistive devices. The dataset itself looks like something others could build on for urban hazard work. The soft spot is the step that actually matters for the claimed use case. The abstract describes a weighted risk scoring system that collapses VLM answers into four discrete safety categories for navigable maps, yet it only reports per-question metrics. There are no numbers on how well those aggregated labels match independent safety assessments, expert annotations, or real-world hazard data. No ablations on the weights, no handling details for conflicting or hallucinated answers, and no cross-validation of the thresholds. That leaves the central claim—that the maps are reliable enough for navigation—resting on an untested assumption even if the individual VQA performance holds. If the full paper adds ground-truth checks or error analysis on the final maps, the picture changes; otherwise the end-to-end reliability stays unclear. This is aimed at people working on vision-language models for accessibility and urban navigation tools. It has enough concrete new material (dataset plus pipeline) to justify sending it to peer review, though referees will need to press on the validation gap before any stronger claims can be accepted.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an event map framework for risk-aware navigation assistance for individuals with low vision. It utilizes a three-level hierarchical VQA pipeline with off-the-shelf VLMs to analyze urban scenes and identify hazards. A weighted risk scoring system aggregates model responses to classify locations into four safety categories. The authors introduce a new dataset of 800+ images from 20 cities with 18k VQA annotations and benchmark ViLT, LLaVA, InstructBLIP, and Qwen-VL, concluding that generative MLLMs, particularly Qwen-VL, offer superior precision-recall performance for this application.

Significance. If the end-to-end system is shown to produce reliable safety maps, the work could advance flexible assistive technologies by demonstrating the use of general-purpose VLMs without task-specific fine-tuning. The geographically diverse dataset spanning six continents is a notable contribution that could facilitate further research in computer vision for accessibility. The benchmarking results highlight the advantages of generative models over classification-based ones in this context.

major comments (3)

[Evaluation] The reported precision and recall metrics are limited to individual VQA questions (18k answers); no quantitative evaluation is provided for the accuracy of the downstream weighted risk scoring system in producing the four safety categories, such as comparison to expert-labeled ground truth or agreement metrics on the final maps. This is load-bearing for the central claim that the framework produces reliable navigable risk maps.
[Method (risk aggregation)] The weighting scheme for converting VLM responses into risk scores, including the specific weights, category thresholds, and handling of conflicting or hallucinated answers, lacks any reported calibration, cross-validation, or ablation against real-world risk data.
[Dataset] No inter-rater agreement metrics (e.g., Cohen's kappa) are reported for the annotations of the 800 images and 18k questions, which is necessary to establish the reliability of the ground truth used for all benchmarking claims.

minor comments (2)

[Abstract] The abstract states that Qwen-VL achieves the 'best overall balance of precision and recall' but supplies no numerical values; these should be included for transparency.
[Figures] Event map visualizations would benefit from explicit legends and scale bars to clarify the four safety category color codings and geographic coverage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional evaluation and documentation would strengthen the claims regarding the reliability of the event maps. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] The reported precision and recall metrics are limited to individual VQA questions (18k answers); no quantitative evaluation is provided for the accuracy of the downstream weighted risk scoring system in producing the four safety categories, such as comparison to expert-labeled ground truth or agreement metrics on the final maps. This is load-bearing for the central claim that the framework produces reliable navigable risk maps.

Authors: We agree that direct evaluation of the aggregated risk maps is important to support the central claim. The current experiments focus on the VQA stage because it is the core technical contribution and the source of all variability. In the revision we will add a new evaluation subsection that obtains expert safety-category labels (four categories) for a random subset of 150 images spanning multiple cities. We will report precision/recall and Cohen's kappa between the system's weighted risk output and these expert labels, plus qualitative examples of the resulting maps. This provides the missing quantitative link without requiring a new dataset. revision: yes
Referee: [Method (risk aggregation)] The weighting scheme for converting VLM responses into risk scores, including the specific weights, category thresholds, and handling of conflicting or hallucinated answers, lacks any reported calibration, cross-validation, or ablation against real-world risk data.

Authors: The weights were derived from accessibility literature and consultation with two low-vision navigation specialists; we will add the exact numerical weights, the four category thresholds, and the rule for handling conflicting answers (majority vote with tie-breaking by highest-risk category) to the revised Methods section. We will also include a sensitivity ablation that varies the weights by ±20 % and shows the resulting change in category distribution across the 20-city dataset. A full calibration against real-world incident data is not feasible with currently available public sources, so we will explicitly note this as a limitation and future work. revision: partial
Referee: [Dataset] No inter-rater agreement metrics (e.g., Cohen's kappa) are reported for the annotations of the 800 images and 18k questions, which is necessary to establish the reliability of the ground truth used for all benchmarking claims.

Authors: We will compute and report inter-rater agreement in the revised dataset section. The 800 images were labeled for safety category by three independent annotators following a written guideline; we will report Fleiss' kappa on the four-category labels. For the 18k VQA answers, a 10 % random sample was double-annotated and we will report Cohen's kappa on that sample. These statistics will be added to Table 1 and the text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks on new dataset with off-the-shelf models

full rationale

The paper introduces a hierarchical VQA pipeline and weighted aggregation for risk mapping, then evaluates four standard VLMs on a newly collected 800-image/18k-question dataset using precision and recall. No equations, fitted parameters, or predictions are defined in terms of the target outputs; the aggregation step is presented as a fixed heuristic without reported calibration on the evaluation data. Benchmarks compare public models directly against author annotations, with no self-citation load-bearing on uniqueness or ansatz. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that current VLMs already contain sufficient scene-understanding knowledge and that a simple weighted sum of their answers can be mapped to meaningful safety categories; no new physical entities are postulated.

free parameters (1)

risk weights
Weights used to aggregate model answers into the final risk score for each street segment; values are not stated in the abstract and must be chosen or fitted.

axioms (1)

domain assumption Off-the-shelf VLMs can produce accurate pedestrian-scene descriptions and hazard identifications across varied urban environments
Invoked when the authors apply ViLT, LLaVA, InstructBLIP and Qwen-VL without retraining.

invented entities (1)

event maps no independent evidence
purpose: Discrete safety-category overlays on street segments for route planning
New representational construct introduced to turn VQA outputs into navigable risk data.

pith-pipeline@v0.9.0 · 5531 in / 1473 out tokens · 30583 ms · 2026-05-13T05:53:38.800203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

World blindness and visual impairment: despite many successes, the problem is growing,

P. Ackland, S. Resnikoff, and R. Bourne, “World blindness and visual impairment: despite many successes, the problem is growing,”Commu- nity Eye Health, vol. 30, no. 100, pp. 71–73, 2018

work page 2018
[2]

Motion planning for autonomous driving: The state of the art and future perspectives,

S. Teng, X. Hu, P. Deng, B. Li, Y . Li, Y . Ai, D. Yang, L. Li, Z. Xuanyuan, F. Zhu, and L. Chen, “Motion planning for autonomous driving: The state of the art and future perspectives,”IEEE Transactions on Intelligent Vehicles, vol. 8, no. 6, pp. 3692–3711, 2023

work page 2023
[3]

Deep leaning-based ultra-fast stair detection,

C. Wang, Z. Pei, S. Qiu, and Z. Tang, “Deep leaning-based ultra-fast stair detection,”Scientific Reports, vol. 12, no. 1, p. 16124, 2022

work page 2022
[4]

Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,

H. Hwang, S. Kwon, Y . Kim, and D. Kim, “Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,” 2024 21st International Conference on Ubiquitous Robots (UR), pp. 281–288, 2024

work page 2024
[5]

Visual language integration: A survey and open challenges,

S.-M. Park and Y .-G. Kim, “Visual language integration: A survey and open challenges,”Computer Science Review, vol. 48, p. 100548, 2023

work page 2023
[6]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inICCV, 2015, pp. 2425– 2433

work page 2015
[7]

Hi- erarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,

A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard, “Hi- erarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

work page 2024
[8]

Taskography: Evaluating robot task planning over large 3d scene graphs,

C. Agia, K. M. Jatavallabhula, M. Khodeir, O. Miksik, V . Vineet, M. Mukadam, L. Paull, and F. Shkurti, “Taskography: Evaluating robot task planning over large 3d scene graphs,” inProceedings of the 5th Conference on Robot Learning, vol. 164, 2022, pp. 46–58

work page 2022
[9]

Vilt: Vision-and-language transformer without convolution or region supervision,

W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” inICML, vol. 139, 2021, pp. 5583–5594

work page 2021
[10]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, vol. 36, 2023, pp. 34 892–34 916

work page 2023
[11]

Instructblip: towards general-purpose vision- language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: towards general-purpose vision- language models with instruction tuning,” inNeurIPS, 2023

work page 2023
[12]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Mapless urban robot navigation by following pedestrians,

S. Buckeridge, P. Carreno-Medrano, A. Cosgun, E. Croft, and W. P. Chan, “Mapless urban robot navigation by following pedestrians,” in IROS, 2023, pp. 6787–6792

work page 2023
[14]

Dynamic channel: A planning framework for crowd navigation,

C. Cao, P. Trautman, and S. Iba, “Dynamic channel: A planning framework for crowd navigation,” inICRA, 2019, pp. 5551–5557

work page 2019
[15]

Com- puter vision and deep learning techniques for pedestrian detection and tracking: A survey,

A. Brunetti, D. Buongiorno, G. F. Trotta, and V . Bevilacqua, “Com- puter vision and deep learning techniques for pedestrian detection and tracking: A survey,”Neurocomputing, vol. 300, pp. 17–33, 2018

work page 2018
[16]

Group surfing: A pedestrian- based approach to sidewalk robot navigation,

Y . Du, N. J. Hetherington, C. L. Oon, W. P. Chan, C. P. Quintero, E. Croft, and H. Machiel Van der Loos, “Group surfing: A pedestrian- based approach to sidewalk robot navigation,” inICRA, 2019, pp. 6518– 6524

work page 2019
[17]

Analysis of the recent ai for pedestrian navigation with wearable inertial sensors,

H. Fu, V . Renaudin, Y . Kone, and N. Zhu, “Analysis of the recent ai for pedestrian navigation with wearable inertial sensors,”IEEE Journal of Indoor and Seamless Positioning and Navigation, vol. 1, pp. 26–38, 2023

work page 2023
[18]

Seamless outdoor-indoor pedestrian positioning system with gnss/uwb/imu fusion: A comparison of ekf, fgo, and pf,

J. Zhang, X. Yu, S. Ha, P. T. Mor´on, S. Salimpour, F. Keramat, H. Zhang, and T. Westerlund, “Seamless outdoor-indoor pedestrian positioning system with gnss/uwb/imu fusion: A comparison of ekf, fgo, and pf,” ArXiv, vol. abs/2512.10480, 2025

work page arXiv 2025
[19]

From research to app: Personalized inertial navigation for the visually impaired,

T. Moisan, H. Fu, V . Renaudin, and M. I. Sayyaf, “From research to app: Personalized inertial navigation for the visually impaired,” in Proceedings of the 2025 International Conference on Indoor Positioning and Indoor Navigation (IPIN). Tampere, Finland: Tampere University, 2025

work page 2025
[20]

Improving pedestrian navigation in urban environment using augmented reality and landmark recognition,

D. Kumar, S. Iyer, E. Raja, R. Kumar, and V . P. Kafle, “Improving pedestrian navigation in urban environment using augmented reality and landmark recognition,”IEEE Communications Standards Magazine, vol. 8, no. 1, pp. 20–26, 2024

work page 2024
[21]

Landmark-based pedestrian navigation from collec- tions of geotagged photos,

H. Hile, R. Vedantham, G. Cuellar, A. Liu, N. Gelfand, R. Grzeszczuk, and G. Borriello, “Landmark-based pedestrian navigation from collec- tions of geotagged photos,” inProceedings of the 7th International Conference on Mobile and Ubiquitous Multimedia, 12 2008, pp. 145– 152

work page 2008
[22]

Person- alized landmark adaptive visualization method for pedestrian navigation maps: Considering user familiarity,

L. Zhu, J. Shen, J. Zhou, Z. Stacho ˇn, S. Hong, and X. Wang, “Person- alized landmark adaptive visualization method for pedestrian navigation maps: Considering user familiarity,”Transactions in GIS, vol. 26, no. 2, pp. 669–690, 2022

work page 2022
[23]

A Personalised Pedestrian Navigation System,

U. Shah and J. Wang, “A Personalised Pedestrian Navigation System,” in12th International Conference on Geographic Information Science, ser. Leibniz International Proceedings in Informatics (LIPIcs), vol. 277, 2023, pp. 67:1–67:6

work page 2023
[24]

What about people in pedestrian navigation?

Z. Fang, Q. Li, and S.-L. Shaw, “What about people in pedestrian navigation?”Geo-spatial Information Science, vol. 18, no. 4, pp. 135– 150, 2015

work page 2015
[25]

A system for generating customized pleasant pedestrian routes based on openstreetmap data,

T. Novack, Z. Wang, and A. Zipf, “A system for generating customized pleasant pedestrian routes based on openstreetmap data,”Sensors, vol. 18, p. 3794, 2018

work page 2018
[26]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139, 2021, pp. 8748–8763

work page 2021
[27]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inICML, vol. 162, 2022, pp. 12 888–12 900

work page 2022
[28]

Open- vocabulary object detection upon frozen vision and language models,

W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “Open- vocabulary object detection upon frozen vision and language models,” inICLR, 2023

work page 2023
[29]

Lami-detr: Open-vocabulary detection with language model instruction,

P. Du, Y . Wang, Y . Sun, L. Wang, Y . Liao, G. Zhang, E. Ding, Y . Wang, J. Wang, and S. Liu, “Lami-detr: Open-vocabulary detection with language model instruction,” inECCV, 2024

work page 2024
[30]

A survey on open-vocabulary detection and segmentation: Past, present, and future,

C. Zhu and L. Chen, “A survey on open-vocabulary detection and segmentation: Past, present, and future,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, p. 8954–8975, 2024

work page 2024
[31]

Clip-vg: Self-paced curriculum adapting of clip for visual grounding,

L. Xiao, X. Yang, F. Peng, M. Yan, Y . Wang, and C. Xu, “Clip-vg: Self-paced curriculum adapting of clip for visual grounding,”IEEE Transactions on Multimedia, vol. 26, p. 4334–4347, 2024. 10

work page 2024
[32]

Unleashing text-to- image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to- image diffusion models for visual perception,” inICCV, October 2023, pp. 5729–5739

work page 2023
[33]

Generating contextually-relevant navigation instructions for blind and low vision people,

Z. Merchant, A. Anwar, E. H. Wang, S. Chattopadhyay, and J. Thoma- son, “Generating contextually-relevant navigation instructions for blind and low vision people,” inThe 1st InterAI Workshop: Interactive AI for Human-centered Robotics, 2024

work page 2024
[34]

Vialm: A survey and benchmark of visually impaired assistance with large models,

Y . Zhao, Y . Zhang, R. Xiang, J. Li, and H. Li, “Vialm: A survey and benchmark of visually impaired assistance with large models,”ArXiv, vol. abs/2402.01735, 2024

work page arXiv 2024
[35]

Be My AI,

Be My Eyes, “Be My AI,” https://www.bemyeyes.com/

work page
[36]

Walkvlm: Aid visually impaired people walking by vision language model,

Z. Yuan, T. Zhang, Y . Zhu, J. Zhang, Y . Deng, Z. Jia, P. Luo, X. Duan, J. Zhou, and J. Zhang, “Walkvlm: Aid visually impaired people walking by vision language model,” inICCV, October 2025, pp. 9845–9854

work page 2025
[37]

Vqa-driven event maps for assistive navigation for people with low vision in urban environments,

J. Morales, B. Gebregziabher, A. Caba ˜neros, and J. Sanchez-Riera, “Vqa-driven event maps for assistive navigation for people with low vision in urban environments,” inICRA, 2025, pp. 12 458–12 464

work page 2025
[38]

Vizwiz grand challenge: Answering visual questions from blind people,

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” inCVPR, 2018

work page 2018
[39]

Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people,

D. Gurari, Q. Li, C. Lin, Y . Zhao, A. Guo, A. Stangl, and J. P. Bigham, “Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people,” inCVPR, 2019, pp. 939–948

work page 2019
[40]

Guidedog: A real-world egocentric multimodal dataset for blind and low-vision accessibility-aware guidance,

J. Kim, J. Park, J. Park, S. Lee, J. Chung, J. Kim, J. H. Joung, and Y . Yu, “Guidedog: A real-world egocentric multimodal dataset for blind and low-vision accessibility-aware guidance,” 2025

work page 2025
[41]

3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,

A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” inRobotics: Science and Systems (RSS), 2020

work page 2020
[42]

Optimal scene graph planning with large language model guidance,

Z. Dai, A. Asgharivaskasi, T. Duong, S. Lin, M.-E. Tzes, G. Pappas, and N. Atanasov, “Optimal scene graph planning with large language model guidance,” inICRA, 2024, pp. 14 062–14 069

work page 2024
[43]

Bird’s-eye-view scene graph for vision-language navigation,

R. Liu, X. Wang, W. Wang, and Y . Yang, “Bird’s-eye-view scene graph for vision-language navigation,” inICCV, October 2023, pp. 10 968– 10 980

work page 2023
[44]

Long-term object search using incremental scene graph updating,

F. Zhou, H. Liu, H. Zhao, and L. Liang, “Long-term object search using incremental scene graph updating,”Robotica, vol. 41, no. 3, p. 962–975, 2023. Antoni Vallsreceived the B.Sc. degree in theoretical physics from the Universitat de Barcelona, Spain, in 2022, and the M.Sc. degree in data science from the University of Padua, Italy, in 2024. He has pre- v...

work page 2023