pith. machine review for the scientific record. sign in

arxiv: 2605.11782 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual question answeringvision language modelsassistive navigationhazard detectionrisk mappinglow visionurban environmentsmultimodal models
0
0 comments X

The pith

Vision-language models enable risk-aware urban navigation maps for people with low vision through visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a system that uses visual question answering with vision-language models to describe pedestrian scenes and identify hazards in cities. By applying a three-level query structure and aggregating answers into risk scores, it classifies street segments into four safety categories for navigation planning. The approach is tested on a new dataset covering 20 cities worldwide, revealing that generative multimodal models like Qwen-VL provide better precision and recall than classification methods, offering a flexible way to assist visually impaired people without custom training.

Core claim

Generative Multimodal Large Language Models substantially outperform classification-based approaches in VQA for hazard identification, with Qwen-VL achieving the best balance of precision and recall. This supports the creation of navigable risk-aware event maps from aggregated model responses using a hierarchical query structure, demonstrating viability as a foundation for assistive navigation systems.

What carries the argument

VQA-based event map framework with three-level hierarchical queries on VLMs aggregated via weighted risk scoring into four safety categories.

Load-bearing premise

Responses from off-the-shelf VLMs can be reliably aggregated into accurate four-category safety labels that generalize to new urban settings without additional training or validation.

What would settle it

A field test showing that the generated risk maps frequently misclassify hazards in cities outside the 20-city dataset, leading to unsafe navigation suggestions.

Figures

Figures reproduced from arXiv: 2605.11782 by Antoni Valls, Jordi Sanchez-Riera.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework. Time-sequential keyframes [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of the hierarchical multicategory query structure application for a street scene in Buenos Aires. The hierarchy begins with Level-1 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Geographic distribution of the dataset across 20 cities spanning six [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Risk event maps generated by Qwen-VL for routes in Barcelona [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an event map framework for risk-aware navigation assistance for individuals with low vision. It utilizes a three-level hierarchical VQA pipeline with off-the-shelf VLMs to analyze urban scenes and identify hazards. A weighted risk scoring system aggregates model responses to classify locations into four safety categories. The authors introduce a new dataset of 800+ images from 20 cities with 18k VQA annotations and benchmark ViLT, LLaVA, InstructBLIP, and Qwen-VL, concluding that generative MLLMs, particularly Qwen-VL, offer superior precision-recall performance for this application.

Significance. If the end-to-end system is shown to produce reliable safety maps, the work could advance flexible assistive technologies by demonstrating the use of general-purpose VLMs without task-specific fine-tuning. The geographically diverse dataset spanning six continents is a notable contribution that could facilitate further research in computer vision for accessibility. The benchmarking results highlight the advantages of generative models over classification-based ones in this context.

major comments (3)
  1. [Evaluation] The reported precision and recall metrics are limited to individual VQA questions (18k answers); no quantitative evaluation is provided for the accuracy of the downstream weighted risk scoring system in producing the four safety categories, such as comparison to expert-labeled ground truth or agreement metrics on the final maps. This is load-bearing for the central claim that the framework produces reliable navigable risk maps.
  2. [Method (risk aggregation)] The weighting scheme for converting VLM responses into risk scores, including the specific weights, category thresholds, and handling of conflicting or hallucinated answers, lacks any reported calibration, cross-validation, or ablation against real-world risk data.
  3. [Dataset] No inter-rater agreement metrics (e.g., Cohen's kappa) are reported for the annotations of the 800 images and 18k questions, which is necessary to establish the reliability of the ground truth used for all benchmarking claims.
minor comments (2)
  1. [Abstract] The abstract states that Qwen-VL achieves the 'best overall balance of precision and recall' but supplies no numerical values; these should be included for transparency.
  2. [Figures] Event map visualizations would benefit from explicit legends and scale bars to clarify the four safety category color codings and geographic coverage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional evaluation and documentation would strengthen the claims regarding the reliability of the event maps. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation] The reported precision and recall metrics are limited to individual VQA questions (18k answers); no quantitative evaluation is provided for the accuracy of the downstream weighted risk scoring system in producing the four safety categories, such as comparison to expert-labeled ground truth or agreement metrics on the final maps. This is load-bearing for the central claim that the framework produces reliable navigable risk maps.

    Authors: We agree that direct evaluation of the aggregated risk maps is important to support the central claim. The current experiments focus on the VQA stage because it is the core technical contribution and the source of all variability. In the revision we will add a new evaluation subsection that obtains expert safety-category labels (four categories) for a random subset of 150 images spanning multiple cities. We will report precision/recall and Cohen's kappa between the system's weighted risk output and these expert labels, plus qualitative examples of the resulting maps. This provides the missing quantitative link without requiring a new dataset. revision: yes

  2. Referee: [Method (risk aggregation)] The weighting scheme for converting VLM responses into risk scores, including the specific weights, category thresholds, and handling of conflicting or hallucinated answers, lacks any reported calibration, cross-validation, or ablation against real-world risk data.

    Authors: The weights were derived from accessibility literature and consultation with two low-vision navigation specialists; we will add the exact numerical weights, the four category thresholds, and the rule for handling conflicting answers (majority vote with tie-breaking by highest-risk category) to the revised Methods section. We will also include a sensitivity ablation that varies the weights by ±20 % and shows the resulting change in category distribution across the 20-city dataset. A full calibration against real-world incident data is not feasible with currently available public sources, so we will explicitly note this as a limitation and future work. revision: partial

  3. Referee: [Dataset] No inter-rater agreement metrics (e.g., Cohen's kappa) are reported for the annotations of the 800 images and 18k questions, which is necessary to establish the reliability of the ground truth used for all benchmarking claims.

    Authors: We will compute and report inter-rater agreement in the revised dataset section. The 800 images were labeled for safety category by three independent annotators following a written guideline; we will report Fleiss' kappa on the four-category labels. For the 18k VQA answers, a 10 % random sample was double-annotated and we will report Cohen's kappa on that sample. These statistics will be added to Table 1 and the text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks on new dataset with off-the-shelf models

full rationale

The paper introduces a hierarchical VQA pipeline and weighted aggregation for risk mapping, then evaluates four standard VLMs on a newly collected 800-image/18k-question dataset using precision and recall. No equations, fitted parameters, or predictions are defined in terms of the target outputs; the aggregation step is presented as a fixed heuristic without reported calibration on the evaluation data. Benchmarks compare public models directly against author annotations, with no self-citation load-bearing on uniqueness or ansatz. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that current VLMs already contain sufficient scene-understanding knowledge and that a simple weighted sum of their answers can be mapped to meaningful safety categories; no new physical entities are postulated.

free parameters (1)
  • risk weights
    Weights used to aggregate model answers into the final risk score for each street segment; values are not stated in the abstract and must be chosen or fitted.
axioms (1)
  • domain assumption Off-the-shelf VLMs can produce accurate pedestrian-scene descriptions and hazard identifications across varied urban environments
    Invoked when the authors apply ViLT, LLaVA, InstructBLIP and Qwen-VL without retraining.
invented entities (1)
  • event maps no independent evidence
    purpose: Discrete safety-category overlays on street segments for route planning
    New representational construct introduced to turn VQA outputs into navigable risk data.

pith-pipeline@v0.9.0 · 5531 in / 1473 out tokens · 30583 ms · 2026-05-13T05:53:38.800203+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    World blindness and visual impairment: despite many successes, the problem is growing,

    P. Ackland, S. Resnikoff, and R. Bourne, “World blindness and visual impairment: despite many successes, the problem is growing,”Commu- nity Eye Health, vol. 30, no. 100, pp. 71–73, 2018

  2. [2]

    Motion planning for autonomous driving: The state of the art and future perspectives,

    S. Teng, X. Hu, P. Deng, B. Li, Y . Li, Y . Ai, D. Yang, L. Li, Z. Xuanyuan, F. Zhu, and L. Chen, “Motion planning for autonomous driving: The state of the art and future perspectives,”IEEE Transactions on Intelligent Vehicles, vol. 8, no. 6, pp. 3692–3711, 2023

  3. [3]

    Deep leaning-based ultra-fast stair detection,

    C. Wang, Z. Pei, S. Qiu, and Z. Tang, “Deep leaning-based ultra-fast stair detection,”Scientific Reports, vol. 12, no. 1, p. 16124, 2022

  4. [4]

    Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,

    H. Hwang, S. Kwon, Y . Kim, and D. Kim, “Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,” 2024 21st International Conference on Ubiquitous Robots (UR), pp. 281–288, 2024

  5. [5]

    Visual language integration: A survey and open challenges,

    S.-M. Park and Y .-G. Kim, “Visual language integration: A survey and open challenges,”Computer Science Review, vol. 48, p. 100548, 2023

  6. [6]

    Vqa: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inICCV, 2015, pp. 2425– 2433

  7. [7]

    Hi- erarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,

    A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard, “Hi- erarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

  8. [8]

    Taskography: Evaluating robot task planning over large 3d scene graphs,

    C. Agia, K. M. Jatavallabhula, M. Khodeir, O. Miksik, V . Vineet, M. Mukadam, L. Paull, and F. Shkurti, “Taskography: Evaluating robot task planning over large 3d scene graphs,” inProceedings of the 5th Conference on Robot Learning, vol. 164, 2022, pp. 46–58

  9. [9]

    Vilt: Vision-and-language transformer without convolution or region supervision,

    W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” inICML, vol. 139, 2021, pp. 5583–5594

  10. [10]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, vol. 36, 2023, pp. 34 892–34 916

  11. [11]

    Instructblip: towards general-purpose vision- language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: towards general-purpose vision- language models with instruction tuning,” inNeurIPS, 2023

  12. [12]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  13. [13]

    Mapless urban robot navigation by following pedestrians,

    S. Buckeridge, P. Carreno-Medrano, A. Cosgun, E. Croft, and W. P. Chan, “Mapless urban robot navigation by following pedestrians,” in IROS, 2023, pp. 6787–6792

  14. [14]

    Dynamic channel: A planning framework for crowd navigation,

    C. Cao, P. Trautman, and S. Iba, “Dynamic channel: A planning framework for crowd navigation,” inICRA, 2019, pp. 5551–5557

  15. [15]

    Com- puter vision and deep learning techniques for pedestrian detection and tracking: A survey,

    A. Brunetti, D. Buongiorno, G. F. Trotta, and V . Bevilacqua, “Com- puter vision and deep learning techniques for pedestrian detection and tracking: A survey,”Neurocomputing, vol. 300, pp. 17–33, 2018

  16. [16]

    Group surfing: A pedestrian- based approach to sidewalk robot navigation,

    Y . Du, N. J. Hetherington, C. L. Oon, W. P. Chan, C. P. Quintero, E. Croft, and H. Machiel Van der Loos, “Group surfing: A pedestrian- based approach to sidewalk robot navigation,” inICRA, 2019, pp. 6518– 6524

  17. [17]

    Analysis of the recent ai for pedestrian navigation with wearable inertial sensors,

    H. Fu, V . Renaudin, Y . Kone, and N. Zhu, “Analysis of the recent ai for pedestrian navigation with wearable inertial sensors,”IEEE Journal of Indoor and Seamless Positioning and Navigation, vol. 1, pp. 26–38, 2023

  18. [18]

    Seamless outdoor-indoor pedestrian positioning system with gnss/uwb/imu fusion: A comparison of ekf, fgo, and pf,

    J. Zhang, X. Yu, S. Ha, P. T. Mor´on, S. Salimpour, F. Keramat, H. Zhang, and T. Westerlund, “Seamless outdoor-indoor pedestrian positioning system with gnss/uwb/imu fusion: A comparison of ekf, fgo, and pf,” ArXiv, vol. abs/2512.10480, 2025

  19. [19]

    From research to app: Personalized inertial navigation for the visually impaired,

    T. Moisan, H. Fu, V . Renaudin, and M. I. Sayyaf, “From research to app: Personalized inertial navigation for the visually impaired,” in Proceedings of the 2025 International Conference on Indoor Positioning and Indoor Navigation (IPIN). Tampere, Finland: Tampere University, 2025

  20. [20]

    Improving pedestrian navigation in urban environment using augmented reality and landmark recognition,

    D. Kumar, S. Iyer, E. Raja, R. Kumar, and V . P. Kafle, “Improving pedestrian navigation in urban environment using augmented reality and landmark recognition,”IEEE Communications Standards Magazine, vol. 8, no. 1, pp. 20–26, 2024

  21. [21]

    Landmark-based pedestrian navigation from collec- tions of geotagged photos,

    H. Hile, R. Vedantham, G. Cuellar, A. Liu, N. Gelfand, R. Grzeszczuk, and G. Borriello, “Landmark-based pedestrian navigation from collec- tions of geotagged photos,” inProceedings of the 7th International Conference on Mobile and Ubiquitous Multimedia, 12 2008, pp. 145– 152

  22. [22]

    Person- alized landmark adaptive visualization method for pedestrian navigation maps: Considering user familiarity,

    L. Zhu, J. Shen, J. Zhou, Z. Stacho ˇn, S. Hong, and X. Wang, “Person- alized landmark adaptive visualization method for pedestrian navigation maps: Considering user familiarity,”Transactions in GIS, vol. 26, no. 2, pp. 669–690, 2022

  23. [23]

    A Personalised Pedestrian Navigation System,

    U. Shah and J. Wang, “A Personalised Pedestrian Navigation System,” in12th International Conference on Geographic Information Science, ser. Leibniz International Proceedings in Informatics (LIPIcs), vol. 277, 2023, pp. 67:1–67:6

  24. [24]

    What about people in pedestrian navigation?

    Z. Fang, Q. Li, and S.-L. Shaw, “What about people in pedestrian navigation?”Geo-spatial Information Science, vol. 18, no. 4, pp. 135– 150, 2015

  25. [25]

    A system for generating customized pleasant pedestrian routes based on openstreetmap data,

    T. Novack, Z. Wang, and A. Zipf, “A system for generating customized pleasant pedestrian routes based on openstreetmap data,”Sensors, vol. 18, p. 3794, 2018

  26. [26]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139, 2021, pp. 8748–8763

  27. [27]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inICML, vol. 162, 2022, pp. 12 888–12 900

  28. [28]

    Open- vocabulary object detection upon frozen vision and language models,

    W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “Open- vocabulary object detection upon frozen vision and language models,” inICLR, 2023

  29. [29]

    Lami-detr: Open-vocabulary detection with language model instruction,

    P. Du, Y . Wang, Y . Sun, L. Wang, Y . Liao, G. Zhang, E. Ding, Y . Wang, J. Wang, and S. Liu, “Lami-detr: Open-vocabulary detection with language model instruction,” inECCV, 2024

  30. [30]

    A survey on open-vocabulary detection and segmentation: Past, present, and future,

    C. Zhu and L. Chen, “A survey on open-vocabulary detection and segmentation: Past, present, and future,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, p. 8954–8975, 2024

  31. [31]

    Clip-vg: Self-paced curriculum adapting of clip for visual grounding,

    L. Xiao, X. Yang, F. Peng, M. Yan, Y . Wang, and C. Xu, “Clip-vg: Self-paced curriculum adapting of clip for visual grounding,”IEEE Transactions on Multimedia, vol. 26, p. 4334–4347, 2024. 10

  32. [32]

    Unleashing text-to- image diffusion models for visual perception,

    W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to- image diffusion models for visual perception,” inICCV, October 2023, pp. 5729–5739

  33. [33]

    Generating contextually-relevant navigation instructions for blind and low vision people,

    Z. Merchant, A. Anwar, E. H. Wang, S. Chattopadhyay, and J. Thoma- son, “Generating contextually-relevant navigation instructions for blind and low vision people,” inThe 1st InterAI Workshop: Interactive AI for Human-centered Robotics, 2024

  34. [34]

    Vialm: A survey and benchmark of visually impaired assistance with large models,

    Y . Zhao, Y . Zhang, R. Xiang, J. Li, and H. Li, “Vialm: A survey and benchmark of visually impaired assistance with large models,”ArXiv, vol. abs/2402.01735, 2024

  35. [35]

    Be My AI,

    Be My Eyes, “Be My AI,” https://www.bemyeyes.com/

  36. [36]

    Walkvlm: Aid visually impaired people walking by vision language model,

    Z. Yuan, T. Zhang, Y . Zhu, J. Zhang, Y . Deng, Z. Jia, P. Luo, X. Duan, J. Zhou, and J. Zhang, “Walkvlm: Aid visually impaired people walking by vision language model,” inICCV, October 2025, pp. 9845–9854

  37. [37]

    Vqa-driven event maps for assistive navigation for people with low vision in urban environments,

    J. Morales, B. Gebregziabher, A. Caba ˜neros, and J. Sanchez-Riera, “Vqa-driven event maps for assistive navigation for people with low vision in urban environments,” inICRA, 2025, pp. 12 458–12 464

  38. [38]

    Vizwiz grand challenge: Answering visual questions from blind people,

    D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” inCVPR, 2018

  39. [39]

    Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people,

    D. Gurari, Q. Li, C. Lin, Y . Zhao, A. Guo, A. Stangl, and J. P. Bigham, “Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people,” inCVPR, 2019, pp. 939–948

  40. [40]

    Guidedog: A real-world egocentric multimodal dataset for blind and low-vision accessibility-aware guidance,

    J. Kim, J. Park, J. Park, S. Lee, J. Chung, J. Kim, J. H. Joung, and Y . Yu, “Guidedog: A real-world egocentric multimodal dataset for blind and low-vision accessibility-aware guidance,” 2025

  41. [41]

    3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,

    A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” inRobotics: Science and Systems (RSS), 2020

  42. [42]

    Optimal scene graph planning with large language model guidance,

    Z. Dai, A. Asgharivaskasi, T. Duong, S. Lin, M.-E. Tzes, G. Pappas, and N. Atanasov, “Optimal scene graph planning with large language model guidance,” inICRA, 2024, pp. 14 062–14 069

  43. [43]

    Bird’s-eye-view scene graph for vision-language navigation,

    R. Liu, X. Wang, W. Wang, and Y . Yang, “Bird’s-eye-view scene graph for vision-language navigation,” inICCV, October 2023, pp. 10 968– 10 980

  44. [44]

    Long-term object search using incremental scene graph updating,

    F. Zhou, H. Liu, H. Zhao, and L. Liang, “Long-term object search using incremental scene graph updating,”Robotica, vol. 41, no. 3, p. 962–975, 2023. Antoni Vallsreceived the B.Sc. degree in theoretical physics from the Universitat de Barcelona, Spain, in 2022, and the M.Sc. degree in data science from the University of Padua, Italy, in 2024. He has pre- v...