pith. sign in

arxiv: 2408.10872 · v5 · submitted 2024-08-20 · 💻 cs.CV · cs.AI· cs.ET

V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?

Pith reviewed 2026-05-23 22:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.ET
keywords vision-language modelsroad safety assessmentiRAP standardzero-shot VQAstreet-level imagesinfrastructure riskautomatic mapping
0
0 comments X

The pith

Vision-language models can classify iRAP road safety attributes from single street images without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision-language models can act as zero-shot assessors for road infrastructure risks by answering questions about iRAP-defined attributes such as lane markings, barriers, and pedestrian facilities. It releases the first open dataset of over 2000 annotated Thai street-level images and compares Gemini-1.5-flash and GPT-4o-mini against supervised CNN baselines. The models show weaker spatial reasoning yet better generalization to unseen attribute classes through prompt adjustments alone. This matters in low- and middle-income countries where most roads lack any safety rating because expert annotation is expensive and slow. The central finding is that VLMs become practical tools once their outputs are combined with other data sources rather than used in isolation.

Core claim

Vision-language models can perform zero-shot visual question answering to classify road safety attributes according to the iRAP standard, generalizing to new classes without retraining while underperforming on spatial tasks, as shown on the new ThaiRAP image dataset.

What carries the argument

The V-RoAst zero-shot VQA framework that converts iRAP attribute definitions into natural-language questions posed to VLMs on individual street-level images.

If this is right

  • Road safety ratings can be generated automatically for previously unassessed roads without collecting new labeled training data.
  • Prompt changes allow the same model to handle new regions or updated iRAP criteria without retraining.
  • Integration with complementary data sources can compensate for the models' spatial weaknesses.
  • Low-cost mapping of infrastructure risks becomes feasible in areas that lack expert assessors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing single-image VLM outputs with map layers or multi-view captures could address the documented spatial shortfalls.
  • The method might extend to other infrastructure rating systems beyond iRAP by swapping the question set.
  • Selective human review triggered only on low-confidence VLM responses could balance cost and reliability in practice.

Load-bearing premise

A single street-level image holds enough visual detail for reliable iRAP attribute classification and VLM answers can be used without extra human checks or sensors.

What would settle it

A set of expert-verified images where the VLM consistently fails to detect the presence or absence of a specific iRAP feature such as guardrails or pedestrian crossings.

Figures

Figures reproduced from arXiv: 2408.10872 by Huanfa Chen, Ilya Ilyankou, James Haworth, June Moh Goo, Kerkritt Sriroongvikrai, Meihui Wang, Natchapon Jongwiriyanurak, Nicola Christie, Xinglei Wang, Zichao Zeng.

Figure 1
Figure 1. Figure 1: Locations of the ThaiRAP dataset and an example of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: V-RoAst Dataset Annotation Process tasks can benefit from the reasoning capabilities of exist￾ing VLMs. This setup also opens possibilities for prompt engineering, RAG, and lightweight fine-tuning to enhance performance. 3. Dataset Construction 3.1. Data Collection We provide a real-world iRAP-compliant road assessment dataset comprising 2,037 street-level images (1600×1200 pixels) captured across Bangkok,… view at source ↗
Figure 3
Figure 3. Figure 3: Code distribution: the numbers at the top indicate the unique codes (representing all possible codes). The following abbreviations [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Framework of V-RoAst for Visual Road Assessment [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System Prompt from Figure [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Star rating (motorcyclists) confusion matrix of us [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Road safety assessments are critical yet costly, especially in Low- and Middle-Income Countries (LMICs), where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning-based approaches struggle to generalise across regions. In this paper, we introduce \textit{V-RoAst}, a zero-shot Visual Question Answering (VQA) framework using Vision-Language Models (VLMs) to classify road safety attributes defined by the iRAP standard. We introduce the first open-source dataset from ThaiRAP, consisting of over 2,000 curated street-level images from Thailand annotated for this task. We evaluate Gemini-1.5-flash and GPT-4o-mini on this dataset and benchmark their performance against VGGNet and ResNet baselines. While VLMs underperform on spatial awareness, they generalise well to unseen classes and offer flexible prompt-based reasoning without retraining. Our results show that VLMs can serve as automatic road assessment tools when integrated with complementary data. This work is the first to explore VLMs for zero-shot infrastructure risk assessment and opens new directions for automatic, low-cost road safety mapping. Code and dataset: https://github.com/PongNJ/V-RoAst.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces V-RoAst, a zero-shot VQA framework using VLMs (Gemini-1.5-flash and GPT-4o-mini) to classify iRAP road safety attributes from street-level images. It releases the first open-source ThaiRAP dataset (>2,000 curated images from Thailand) and benchmarks the VLMs against VGGNet and ResNet baselines. The authors observe that VLMs underperform on spatial awareness yet generalize to unseen classes, and conclude that VLMs can serve as automatic road assessment tools when integrated with complementary data; the work is positioned as the first exploration of VLMs for zero-shot infrastructure risk assessment.

Significance. If supported by quantitative evidence, the work could be significant for enabling low-cost, scalable road safety mapping in LMICs where most roads remain unrated. The release of the ThaiRAP dataset provides a reusable resource, and the zero-shot prompt-based approach avoids the region-specific retraining required by supervised baselines. The qualified claim (integration with complementary data) could open practical directions if the integration mechanism and error compensation are demonstrated.

major comments (3)
  1. [Abstract] Abstract: performance claims are stated only qualitatively (VLMs 'underperform on spatial awareness' yet 'generalise well to unseen classes') with no per-attribute accuracy, precision/recall, confusion matrices, or numeric comparison to the VGGNet/ResNet baselines. Without these metrics it is impossible to assess whether the observed generalization is sufficient for the central claim that VLMs can serve as road assessment tools.
  2. [Abstract] Abstract (final paragraph): the headline result that 'VLMs can serve as automatic road assessment tools when integrated with complementary data' is not accompanied by any description of the integration mechanism, any quantitative evidence that complementary data compensates for spatial errors, or any evaluation of the combined system. Many iRAP attributes (curvature, lane width, roadside hazards, sight distance) are inherently spatial, so the noted spatial-awareness failures remain load-bearing.
  3. [Abstract] Abstract: the assumption that single street-level images contain sufficient visual information for reliable iRAP attribute classification is stated without supporting evidence or discussion of failure modes; the paper does not address how VLM responses would be trusted for safety-critical decisions without human verification or additional sensors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas where the abstract can be strengthened with quantitative details and clearer qualifications. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance claims are stated only qualitatively (VLMs 'underperform on spatial awareness' yet 'generalise well to unseen classes') with no per-attribute accuracy, precision/recall, confusion matrices, or numeric comparison to the VGGNet/ResNet baselines. Without these metrics it is impossible to assess whether the observed generalization is sufficient for the central claim that VLMs can serve as road assessment tools.

    Authors: We agree that the abstract should include quantitative metrics to support the claims. The full manuscript (Section 4 and supplementary material) reports per-attribute accuracies, precision, recall, and direct numeric comparisons to VGGNet and ResNet baselines, including confusion matrices for key attributes. We will revise the abstract to incorporate key numeric results (e.g., overall accuracy figures and generalization gaps) while retaining the qualitative summary of spatial vs. generalization performance. revision: yes

  2. Referee: [Abstract] Abstract (final paragraph): the headline result that 'VLMs can serve as automatic road assessment tools when integrated with complementary data' is not accompanied by any description of the integration mechanism, any quantitative evidence that complementary data compensates for spatial errors, or any evaluation of the combined system. Many iRAP attributes (curvature, lane width, roadside hazards, sight distance) are inherently spatial, so the noted spatial-awareness failures remain load-bearing.

    Authors: The manuscript presents the integration statement as a forward-looking qualified claim rather than a result demonstrated in this work, which focuses on zero-shot VLM evaluation. We acknowledge the abstract phrasing implies more support than is provided. We will revise the final paragraph to explicitly state that demonstrating integration mechanisms and error compensation with complementary data (e.g., for spatial attributes) is proposed as future work, removing any implication of current evidence. revision: yes

  3. Referee: [Abstract] Abstract: the assumption that single street-level images contain sufficient visual information for reliable iRAP attribute classification is stated without supporting evidence or discussion of failure modes; the paper does not address how VLM responses would be trusted for safety-critical decisions without human verification or additional sensors.

    Authors: The approach follows the iRAP standard's reliance on street-level imagery, but we agree the abstract lacks explicit discussion of limitations. We will add a limitations section to the revised manuscript that covers failure modes (including spatial reasoning), notes that single-image inputs may be insufficient for certain attributes, and emphasizes that VLM outputs are intended as assistive tools requiring human oversight and potential multi-sensor validation for safety-critical use. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation study with no derivation chain or self-referential claims

full rationale

The paper introduces a new dataset (ThaiRAP) and performs direct empirical evaluation of VLMs (Gemini-1.5-flash, GPT-4o-mini) against CNN baselines (VGGNet, ResNet) for zero-shot iRAP attribute classification. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the central claims. Results are reported from explicit experiments on the held-out test set; the qualified conclusion that VLMs 'can serve as automatic road assessment tools when integrated with complementary data' rests on observed metrics rather than any definitional or self-citation reduction. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that current VLMs possess sufficient visual reasoning to map image content to iRAP safety attributes in a zero-shot setting and that single images are adequate input.

axioms (2)
  • domain assumption VLMs can perform zero-shot visual question answering on road scene images without domain-specific fine-tuning.
    Invoked throughout the abstract as the core method.
  • domain assumption iRAP safety attributes are visually discernible from single street-level photographs.
    Implicit in the choice of input data and task definition.

pith-pipeline@v0.9.0 · 5802 in / 1356 out tokens · 34865 ms · 2026-05-23T22:03:37.018331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images

    cs.CV 2025-06 unverdicted novelty 4.0

    A lightweight multi-modal CLIP pipeline predicts exact-match geographical tags on a Kaggle subset of the Geograph crowdsourced image archive by fusing image, location, and title embeddings.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review

    Abolfazl Abdollahi, Biswajeet Pradhan, Nagesh Shukla, Subrata Chakraborty, and Abdullah Alamri. Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review. Remote Sensing , 12(9):1444, 2020. 2

  2. [2]

    VQA: Visual Question Answering

    Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Mar- garet Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering, 2016. arXiv:1505.00468 [cs]. 2

  3. [3]

    RDD2022: A multi- national image dataset for automatic road damage detection

    Deeksha Arya, Hiroya Maeda, Sanjay Kumar Ghosh, Durga Toshniwal, and Yoshihide Sekimoto. RDD2022: A multi- national image dataset for automatic road damage detection. Geoscience Data Journal, page gdj3.260, 2024. 2

  4. [4]

    An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos

    Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos. Remote Sensing, 12(22):3844, 2020. 2

  5. [5]

    Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images

    Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images. Sensors, 22(15):5510, 2022. 2

  6. [6]

    Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments

    Ivan Brki ´c, Marko ˇSevrovi´c, Damir Medak, and Mario Miler. Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments. Sensors, 23(9): 4405, 2023. 2

  7. [7]

    The global macroeconomic burden of road injuries: estimates and projections for 166 countries

    Simiao Chen, Michael Kuhn, Klaus Prettner, and David E Bloom. The global macroeconomic burden of road injuries: estimates and projections for 166 countries. The Lancet Planetary Health, 3(9):e390–e398, 2019. 1

  8. [8]

    MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025

    Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025. arXiv:2501.00316 [cs]. 2

  9. [9]

    Vision meets robotics: The KITTI dataset

    A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 2

  10. [10]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team Gemini. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024. arXiv:2403.05530 [cs]. 2, 4

  11. [11]

    Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model

    June Moh Goo, Zichao Zeng, and Jan Boehm. Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model. The International Archives of the Photogram- metry, Remote Sensing and Spatial Information Sciences , XLVIII-2-2024:107–113, 2024. 2

  12. [12]

    Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure

    June Moh Goo, Xenios Milidonis, Alessandro Artusi, Jan Boehm, and Carlo Ciliberto. Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure. Automation in Construction , 170: 105960, 2025. 2

  13. [13]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015. arXiv:1512.03385 [cs]. 6

  14. [14]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Dan Hendrycks and Kevin Gimpel. A Baseline for Detect- ing Misclassified and Out-of-Distribution Examples in Neu- ral Networks, 2018. arXiv:1610.02136 [cs]. 2

  15. [15]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring Massive Multitask Language Understanding, 2021. arXiv:2009.03300 [cs]. 2

  16. [16]

    Global Streetscapes — A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics

    Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global Streetscapes — A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogram- metry and Remote Sensing, 215:216–238, 2024. 8

  17. [17]

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

    Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, 2019. arXiv:1902.09506 [cs]. 2

  18. [18]

    Automated pavement distress detection using region based convolutional neural networks

    Eldor Ibragimov, Hyun-Jong Lee, Jong-Jae Lee, and Nam- gyu Kim. Automated pavement distress detection using region based convolutional neural networks. International Journal of Pavement Engineering , 23(6):1981–1992, 2022. 2

  19. [19]

    CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images

    Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, and James Haworth. CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images, 2025. arXiv:2506.12214 [cs]. 2

  20. [20]

    Lynn Abbott, and Abhijit Sarkar

    Sandesh Jain, Surendrabikram Thapa, Kuan-Ting Chen, A. Lynn Abbott, and Abhijit Sarkar. Semantic Understand- ing of Traffic Scenes with Large Vision Language Models. In 2024 IEEE Intelligent Vehicles Symposium (IV), pages 1580– 1587, Jeju Island, Korea, Republic of, 2024. IEEE. 2

  21. [21]

    A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes

    Zohaib Jan, Brijesh Verma, Joseph Affum, Sam Atabak, and Lachlan Moir. A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes. In2018 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Auckland, New Zealand,

  22. [22]

    Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera

    Natchapon Jongwiriyanurak, Zichao Zeng, Meihui Wang, James Haworth, Garavig Tanaksaranond, and Jan Boehm. Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera. In 12th International Conference on Ge- ographic Information Science (GIScience 2023), 2023. 2

  23. [23]

    Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment

    Marin Kacan, Marin Orsic, Sinisa Segvic, and Marko Sevrovic. Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment. In 2020 IEEE 23rd International Conference on Intelligent Transportation Sys- tems (ITSC), pages 1–6, Rhodes, Greece, 2020. IEEE. 2, 6

  24. [24]

    Dynamic Loss Balancing and Sequential Enhancement for Road- Safety Assessment and Traffic Scene Classification

    Marin Ka ˇcan, Marko ˇSevrovi´c, and Siniˇsa ˇSegvi´c. Dynamic Loss Balancing and Sequential Enhancement for Road- Safety Assessment and Traffic Scene Classification. IEEE Transactions on Intelligent Transportation Systems, 25(11): 15628–15640, 2024. 2, 6

  25. [25]

    Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation

    Changjae Kim, Seunghun Lee, and Sunghoon Im. Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation. In BMVC, 2023. 2

  26. [26]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2021. arXiv:2005.11401 [cs]. 2

  27. [27]

    OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025

    Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, and Filip Biljecki. OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025. arXiv:2504.02866 [cs]. 2

  28. [28]

    DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries

    Chunmian Lin, Daxin Tian, Xuting Duan, Jianshan Zhou, Dezong Zhao, and Dongpu Cao. DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries. IEEE Transactions on Intelligent Transportation Sys- tems, 24(3):3091–3103, 2023. 2

  29. [29]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning, 2023. arXiv:2304.08485 [cs]. 2

  30. [30]

    Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms

    Nachuan Ma, Jiahe Fan, Wenshuo Wang, Jin Wu, Yu Jiang, Lihua Xie, and Rui Fan. Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms. Transportation Safety and Environment, 4 (4):tdac026, 2022. 2

  31. [31]

    OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019. arXiv:1906.00067 [cs]. 2

  32. [32]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report, 2024. arXiv:2303.08774 [cs]. 4

  33. [33]

    GPT-4o System Card, 2024

    OpenAI. GPT-4o System Card, 2024. 2

  34. [34]

    Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety

    Thihagoda Gamage Pubudu Sanjeewani and Brijesh Verma. Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety. In 2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Dunedin, New Zealand, 2019. IEEE. 2

  35. [35]

    NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario. Proceedings of the AAAI Conference on Artificial In- telligence, 38(5):4542–4550, 2024. 2

  36. [36]

    GPT4GEO: How a Language Model Sees the World’s Geography

    Jonathan Roberts. GPT4GEO: How a Language Model Sees the World’s Geography. In Foundation Models for Decision Making Workshop at NeurIPS 2023., 2023. 7

  37. [37]

    Optimization of Fully Convolutional Network for Road Safety Attribute De- tection

    Pubudu Sanjeewani and Brijesh Verma. Optimization of Fully Convolutional Network for Road Safety Attribute De- tection. IEEE Access, 9:120525–120536, 2021. 2

  38. [38]

    Single class detection-based deep learning approach for identification of road safety attributes

    Pubudu Sanjeewani and Brijesh Verma. Single class detection-based deep learning approach for identification of road safety attributes. Neural Computing and Applications , 33(15):9691–9702, 2021. 2

  39. [39]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very Deep Convo- lutional Networks for Large-Scale Image Recognition, 2015. arXiv:1409.1556 [cs]. 6

  40. [40]

    FARSA: Fully Automated Roadway Safety Assess- ment

    Weilian Song, Scott Workman, Armin Hadzic, Xu Zhang, Eric Green, Mei Chen, Reginald Souleyrette, and Nathan Ja- cobs. FARSA: Fully Automated Roadway Safety Assess- ment. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 521–529, Lake Tahoe, NV ,

  41. [41]

    Investigating the potential of crowd- sourced street-level imagery in understanding the spatiotem- poral dynamics of cities: A case study of walkability in Inner London

    Meihui Wang, James Haworth, Huanfa Chen, Yunzhe Liu, and Zhengxiang Shi. Investigating the potential of crowd- sourced street-level imagery in understanding the spatiotem- poral dynamics of cities: A case study of walkability in Inner London. Cities, 153:105243, 2024. 8

  42. [42]

    On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving, 2023

    Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, and Yu Qiao. On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving, 2023. arXiv:2311.05332 [cs]. 2

  43. [43]

    Global status report on road safety 2023

    WHO. Global status report on road safety 2023. Technical report, World Health Organization, Geneva, 2023. 1

  44. [44]

    A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter

    Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter. Computers & Graphics, 119:103885, 2024. 2

  45. [45]

    DriveGPT4-V2: Harnessing Large Language Model Capa- bilities for Enhanced Closed-Loop Autonomous Driving

    Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. DriveGPT4-V2: Harnessing Large Language Model Capa- bilities for Enhanced Closed-Loop Autonomous Driving. In CVPR 2025, 2025. 2

  46. [46]

    Multimodal Deep Learning for Robust Road Attribute Detection

    Yifang Yin, Wenmiao Hu, An Tran, Ying Zhang, Guanfeng Wang, Hannes Kruppa, Roger Zimmermann, and See-Kiong Ng. Multimodal Deep Learning for Robust Road Attribute Detection. ACM Transactions on Spatial Algorithms and Systems, 9(4):1–25, 2023. 2

  47. [47]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Under- standing and Reasoning Benchmark for...

  48. [48]

    Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4

    Zichao Zeng, June Moh Goo, Xinglei Wang, Bin Chi, Mei- hui Wang, and Jan Boehm. Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4. The International Archives of the Photogrammetry, Remote Sensing and Spa- tial Information Sciences, XLVIII-2-2024:457–464, 2024. 2

  49. [49]

    ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles

    Jiawei Zhang, Chejian Xu, and Bo Li. ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 15459–15469, Seattle, W A, USA, 2024. IEEE. 2 Appendix Table 3. Performance comparison across all iRAP-defined attributes using four models: VGG...