V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?

Huanfa Chen; Ilya Ilyankou; James Haworth; June Moh Goo; Kerkritt Sriroongvikrai; Meihui Wang; Natchapon Jongwiriyanurak; Nicola Christie; Xinglei Wang; Zichao Zeng

arxiv: 2408.10872 · v5 · submitted 2024-08-20 · 💻 cs.CV · cs.AI· cs.ET

V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?

Natchapon Jongwiriyanurak , Zichao Zeng , June Moh Goo , Xinglei Wang , Ilya Ilyankou , Kerkritt Sriroongvikrai , Nicola Christie , Meihui Wang

show 2 more authors

Huanfa Chen James Haworth

This is my paper

Pith reviewed 2026-05-23 22:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.ET

keywords vision-language modelsroad safety assessmentiRAP standardzero-shot VQAstreet-level imagesinfrastructure riskautomatic mapping

0 comments

The pith

Vision-language models can classify iRAP road safety attributes from single street images without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision-language models can act as zero-shot assessors for road infrastructure risks by answering questions about iRAP-defined attributes such as lane markings, barriers, and pedestrian facilities. It releases the first open dataset of over 2000 annotated Thai street-level images and compares Gemini-1.5-flash and GPT-4o-mini against supervised CNN baselines. The models show weaker spatial reasoning yet better generalization to unseen attribute classes through prompt adjustments alone. This matters in low- and middle-income countries where most roads lack any safety rating because expert annotation is expensive and slow. The central finding is that VLMs become practical tools once their outputs are combined with other data sources rather than used in isolation.

Core claim

Vision-language models can perform zero-shot visual question answering to classify road safety attributes according to the iRAP standard, generalizing to new classes without retraining while underperforming on spatial tasks, as shown on the new ThaiRAP image dataset.

What carries the argument

The V-RoAst zero-shot VQA framework that converts iRAP attribute definitions into natural-language questions posed to VLMs on individual street-level images.

If this is right

Road safety ratings can be generated automatically for previously unassessed roads without collecting new labeled training data.
Prompt changes allow the same model to handle new regions or updated iRAP criteria without retraining.
Integration with complementary data sources can compensate for the models' spatial weaknesses.
Low-cost mapping of infrastructure risks becomes feasible in areas that lack expert assessors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing single-image VLM outputs with map layers or multi-view captures could address the documented spatial shortfalls.
The method might extend to other infrastructure rating systems beyond iRAP by swapping the question set.
Selective human review triggered only on low-confidence VLM responses could balance cost and reliability in practice.

Load-bearing premise

A single street-level image holds enough visual detail for reliable iRAP attribute classification and VLM answers can be used without extra human checks or sensors.

What would settle it

A set of expert-verified images where the VLM consistently fails to detect the presence or absence of a specific iRAP feature such as guardrails or pedestrian crossings.

Figures

Figures reproduced from arXiv: 2408.10872 by Huanfa Chen, Ilya Ilyankou, James Haworth, June Moh Goo, Kerkritt Sriroongvikrai, Meihui Wang, Natchapon Jongwiriyanurak, Nicola Christie, Xinglei Wang, Zichao Zeng.

**Figure 2.** Figure 2: V-RoAst Dataset Annotation Process tasks can benefit from the reasoning capabilities of existing VLMs. This setup also opens possibilities for prompt engineering, RAG, and lightweight fine-tuning to enhance performance. 3. Dataset Construction 3.1. Data Collection We provide a real-world iRAP-compliant road assessment dataset comprising 2,037 street-level images (1600×1200 pixels) captured across Bangkok,… view at source ↗

**Figure 3.** Figure 3: Code distribution: the numbers at the top indicate the unique codes (representing all possible codes). The following abbreviations [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Framework of V-RoAst for Visual Road Assessment [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: System Prompt from Figure [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Star rating (motorcyclists) confusion matrix of us [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Road safety assessments are critical yet costly, especially in Low- and Middle-Income Countries (LMICs), where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning-based approaches struggle to generalise across regions. In this paper, we introduce \textit{V-RoAst}, a zero-shot Visual Question Answering (VQA) framework using Vision-Language Models (VLMs) to classify road safety attributes defined by the iRAP standard. We introduce the first open-source dataset from ThaiRAP, consisting of over 2,000 curated street-level images from Thailand annotated for this task. We evaluate Gemini-1.5-flash and GPT-4o-mini on this dataset and benchmark their performance against VGGNet and ResNet baselines. While VLMs underperform on spatial awareness, they generalise well to unseen classes and offer flexible prompt-based reasoning without retraining. Our results show that VLMs can serve as automatic road assessment tools when integrated with complementary data. This work is the first to explore VLMs for zero-shot infrastructure risk assessment and opens new directions for automatic, low-cost road safety mapping. Code and dataset: https://github.com/PongNJ/V-RoAst.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Releases a new open ThaiRAP dataset and tests zero-shot VLMs on iRAP attributes, but gives no numbers so the practical value stays unclear.

read the letter

The main things to know are that the authors put out the first public ThaiRAP dataset of over 2000 street-level images annotated for iRAP road safety attributes and ran zero-shot tests with Gemini-1.5-flash and GPT-4o-mini against simple CNN baselines. They note the VLMs pick up unseen classes reasonably but struggle with spatial tasks, and they conclude the models could work as road assessors if paired with other data sources. Code and data are shared on GitHub, which is straightforward and helpful for follow-up work. That dataset release is the clearest positive step here, since labeled road imagery from LMICs is hard to come by and iRAP assessments are expensive to run manually. The zero-shot angle also fits the setting where retraining on local data is impractical. The soft spots sit right in the middle of the claims. The abstract and stress-test note both flag the lack of any numeric results—no per-attribute accuracies, no error bars, no breakdown on which iRAP items (curvature, sight distance, hazards) actually succeed or fail. Without those figures it is difficult to judge whether the generalization outweighs the admitted spatial weaknesses, and many core iRAP attributes are spatial. The qualified claim that VLMs work “when integrated with complementary data” is left without any description of the integration or evidence that it would fix the gaps. This is an empirical exploration paper, not a methods advance, so the missing metrics are the main limitation rather than a side issue. Readers working on VLM applications for infrastructure or on road safety data collection in Southeast Asia or similar regions would find the dataset and the basic prompting setup useful to build from. The work shows clear engagement with the iRAP standard and prior VLM literature, so it is coherent on its own terms. I would send it to peer review so the authors can add the quantitative results and test whether the spatial problems are fixable with the complementary data they mention.

Referee Report

3 major / 0 minor

Summary. The paper introduces V-RoAst, a zero-shot VQA framework using VLMs (Gemini-1.5-flash and GPT-4o-mini) to classify iRAP road safety attributes from street-level images. It releases the first open-source ThaiRAP dataset (>2,000 curated images from Thailand) and benchmarks the VLMs against VGGNet and ResNet baselines. The authors observe that VLMs underperform on spatial awareness yet generalize to unseen classes, and conclude that VLMs can serve as automatic road assessment tools when integrated with complementary data; the work is positioned as the first exploration of VLMs for zero-shot infrastructure risk assessment.

Significance. If supported by quantitative evidence, the work could be significant for enabling low-cost, scalable road safety mapping in LMICs where most roads remain unrated. The release of the ThaiRAP dataset provides a reusable resource, and the zero-shot prompt-based approach avoids the region-specific retraining required by supervised baselines. The qualified claim (integration with complementary data) could open practical directions if the integration mechanism and error compensation are demonstrated.

major comments (3)

[Abstract] Abstract: performance claims are stated only qualitatively (VLMs 'underperform on spatial awareness' yet 'generalise well to unseen classes') with no per-attribute accuracy, precision/recall, confusion matrices, or numeric comparison to the VGGNet/ResNet baselines. Without these metrics it is impossible to assess whether the observed generalization is sufficient for the central claim that VLMs can serve as road assessment tools.
[Abstract] Abstract (final paragraph): the headline result that 'VLMs can serve as automatic road assessment tools when integrated with complementary data' is not accompanied by any description of the integration mechanism, any quantitative evidence that complementary data compensates for spatial errors, or any evaluation of the combined system. Many iRAP attributes (curvature, lane width, roadside hazards, sight distance) are inherently spatial, so the noted spatial-awareness failures remain load-bearing.
[Abstract] Abstract: the assumption that single street-level images contain sufficient visual information for reliable iRAP attribute classification is stated without supporting evidence or discussion of failure modes; the paper does not address how VLM responses would be trusted for safety-critical decisions without human verification or additional sensors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas where the abstract can be strengthened with quantitative details and clearer qualifications. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: performance claims are stated only qualitatively (VLMs 'underperform on spatial awareness' yet 'generalise well to unseen classes') with no per-attribute accuracy, precision/recall, confusion matrices, or numeric comparison to the VGGNet/ResNet baselines. Without these metrics it is impossible to assess whether the observed generalization is sufficient for the central claim that VLMs can serve as road assessment tools.

Authors: We agree that the abstract should include quantitative metrics to support the claims. The full manuscript (Section 4 and supplementary material) reports per-attribute accuracies, precision, recall, and direct numeric comparisons to VGGNet and ResNet baselines, including confusion matrices for key attributes. We will revise the abstract to incorporate key numeric results (e.g., overall accuracy figures and generalization gaps) while retaining the qualitative summary of spatial vs. generalization performance. revision: yes
Referee: [Abstract] Abstract (final paragraph): the headline result that 'VLMs can serve as automatic road assessment tools when integrated with complementary data' is not accompanied by any description of the integration mechanism, any quantitative evidence that complementary data compensates for spatial errors, or any evaluation of the combined system. Many iRAP attributes (curvature, lane width, roadside hazards, sight distance) are inherently spatial, so the noted spatial-awareness failures remain load-bearing.

Authors: The manuscript presents the integration statement as a forward-looking qualified claim rather than a result demonstrated in this work, which focuses on zero-shot VLM evaluation. We acknowledge the abstract phrasing implies more support than is provided. We will revise the final paragraph to explicitly state that demonstrating integration mechanisms and error compensation with complementary data (e.g., for spatial attributes) is proposed as future work, removing any implication of current evidence. revision: yes
Referee: [Abstract] Abstract: the assumption that single street-level images contain sufficient visual information for reliable iRAP attribute classification is stated without supporting evidence or discussion of failure modes; the paper does not address how VLM responses would be trusted for safety-critical decisions without human verification or additional sensors.

Authors: The approach follows the iRAP standard's reliance on street-level imagery, but we agree the abstract lacks explicit discussion of limitations. We will add a limitations section to the revised manuscript that covers failure modes (including spatial reasoning), notes that single-image inputs may be insufficient for certain attributes, and emphasizes that VLM outputs are intended as assistive tools requiring human oversight and potential multi-sensor validation for safety-critical use. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation study with no derivation chain or self-referential claims

full rationale

The paper introduces a new dataset (ThaiRAP) and performs direct empirical evaluation of VLMs (Gemini-1.5-flash, GPT-4o-mini) against CNN baselines (VGGNet, ResNet) for zero-shot iRAP attribute classification. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the central claims. Results are reported from explicit experiments on the held-out test set; the qualified conclusion that VLMs 'can serve as automatic road assessment tools when integrated with complementary data' rests on observed metrics rather than any definitional or self-citation reduction. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that current VLMs possess sufficient visual reasoning to map image content to iRAP safety attributes in a zero-shot setting and that single images are adequate input.

axioms (2)

domain assumption VLMs can perform zero-shot visual question answering on road scene images without domain-specific fine-tuning.
Invoked throughout the abstract as the core method.
domain assumption iRAP safety attributes are visually discernible from single street-level photographs.
Implicit in the choice of input data and task definition.

pith-pipeline@v0.9.0 · 5802 in / 1356 out tokens · 34865 ms · 2026-05-23T22:03:37.018331+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images
cs.CV 2025-06 unverdicted novelty 4.0

A lightweight multi-modal CLIP pipeline predicts exact-match geographical tags on a Kaggle subset of the Geograph crowdsourced image archive by fusing image, location, and title embeddings.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review

Abolfazl Abdollahi, Biswajeet Pradhan, Nagesh Shukla, Subrata Chakraborty, and Abdullah Alamri. Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review. Remote Sensing , 12(9):1444, 2020. 2

work page 2020
[2]

VQA: Visual Question Answering

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Mar- garet Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering, 2016. arXiv:1505.00468 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

RDD2022: A multi- national image dataset for automatic road damage detection

Deeksha Arya, Hiroya Maeda, Sanjay Kumar Ghosh, Durga Toshniwal, and Yoshihide Sekimoto. RDD2022: A multi- national image dataset for automatic road damage detection. Geoscience Data Journal, page gdj3.260, 2024. 2

work page 2024
[4]

An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos

Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos. Remote Sensing, 12(22):3844, 2020. 2

work page 2020
[5]

Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images

Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images. Sensors, 22(15):5510, 2022. 2

work page 2022
[6]

Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments

Ivan Brki ´c, Marko ˇSevrovi´c, Damir Medak, and Mario Miler. Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments. Sensors, 23(9): 4405, 2023. 2

work page 2023
[7]

The global macroeconomic burden of road injuries: estimates and projections for 166 countries

Simiao Chen, Michael Kuhn, Klaus Prettner, and David E Bloom. The global macroeconomic burden of road injuries: estimates and projections for 166 countries. The Lancet Planetary Health, 3(9):e390–e398, 2019. 1

work page 2019
[8]

MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025

Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025. arXiv:2501.00316 [cs]. 2

work page arXiv 2025
[9]

Vision meets robotics: The KITTI dataset

A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 2

work page 2013
[10]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team Gemini. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024. arXiv:2403.05530 [cs]. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model

June Moh Goo, Zichao Zeng, and Jan Boehm. Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model. The International Archives of the Photogram- metry, Remote Sensing and Spatial Information Sciences , XLVIII-2-2024:107–113, 2024. 2

work page 2024
[12]

Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure

June Moh Goo, Xenios Milidonis, Alessandro Artusi, Jan Boehm, and Carlo Ciliberto. Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure. Automation in Construction , 170: 105960, 2025. 2

work page 2025
[13]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015. arXiv:1512.03385 [cs]. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A Baseline for Detect- ing Misclassified and Out-of-Distribution Examples in Neu- ral Networks, 2018. arXiv:1610.02136 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring Massive Multitask Language Understanding, 2021. arXiv:2009.03300 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Global Streetscapes — A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global Streetscapes — A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogram- metry and Remote Sensing, 215:216–238, 2024. 8

work page 2024
[17]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, 2019. arXiv:1902.09506 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Automated pavement distress detection using region based convolutional neural networks

Eldor Ibragimov, Hyun-Jong Lee, Jong-Jae Lee, and Nam- gyu Kim. Automated pavement distress detection using region based convolutional neural networks. International Journal of Pavement Engineering , 23(6):1981–1992, 2022. 2

work page 1981
[19]

CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images

Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, and James Haworth. CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images, 2025. arXiv:2506.12214 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Lynn Abbott, and Abhijit Sarkar

Sandesh Jain, Surendrabikram Thapa, Kuan-Ting Chen, A. Lynn Abbott, and Abhijit Sarkar. Semantic Understand- ing of Traffic Scenes with Large Vision Language Models. In 2024 IEEE Intelligent Vehicles Symposium (IV), pages 1580– 1587, Jeju Island, Korea, Republic of, 2024. IEEE. 2

work page 2024
[21]

A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes

Zohaib Jan, Brijesh Verma, Joseph Affum, Sam Atabak, and Lachlan Moir. A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes. In2018 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Auckland, New Zealand,

work page
[22]

Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera

Natchapon Jongwiriyanurak, Zichao Zeng, Meihui Wang, James Haworth, Garavig Tanaksaranond, and Jan Boehm. Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera. In 12th International Conference on Ge- ographic Information Science (GIScience 2023), 2023. 2

work page 2023
[23]

Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment

Marin Kacan, Marin Orsic, Sinisa Segvic, and Marko Sevrovic. Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment. In 2020 IEEE 23rd International Conference on Intelligent Transportation Sys- tems (ITSC), pages 1–6, Rhodes, Greece, 2020. IEEE. 2, 6

work page 2020
[24]

Dynamic Loss Balancing and Sequential Enhancement for Road- Safety Assessment and Traffic Scene Classification

Marin Ka ˇcan, Marko ˇSevrovi´c, and Siniˇsa ˇSegvi´c. Dynamic Loss Balancing and Sequential Enhancement for Road- Safety Assessment and Traffic Scene Classification. IEEE Transactions on Intelligent Transportation Systems, 25(11): 15628–15640, 2024. 2, 6

work page 2024
[25]

Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation

Changjae Kim, Seunghun Lee, and Sunghoon Im. Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation. In BMVC, 2023. 2

work page 2023
[26]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2021. arXiv:2005.11401 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025

Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, and Filip Biljecki. OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025. arXiv:2504.02866 [cs]. 2

work page arXiv 2025
[28]

DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries

Chunmian Lin, Daxin Tian, Xuting Duan, Jianshan Zhou, Dezong Zhao, and Dongpu Cao. DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries. IEEE Transactions on Intelligent Transportation Sys- tems, 24(3):3091–3103, 2023. 2

work page 2023
[29]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning, 2023. arXiv:2304.08485 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms

Nachuan Ma, Jiahe Fan, Wenshuo Wang, Jin Wu, Yu Jiang, Lihua Xie, and Rui Fan. Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms. Transportation Safety and Environment, 4 (4):tdac026, 2022. 2

work page 2022
[31]

OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019. arXiv:1906.00067 [cs]. 2

work page arXiv 2019
[32]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report, 2024. arXiv:2303.08774 [cs]. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

GPT-4o System Card, 2024

OpenAI. GPT-4o System Card, 2024. 2

work page 2024
[34]

Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety

Thihagoda Gamage Pubudu Sanjeewani and Brijesh Verma. Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety. In 2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Dunedin, New Zealand, 2019. IEEE. 2

work page 2019
[35]

NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario. Proceedings of the AAAI Conference on Artificial In- telligence, 38(5):4542–4550, 2024. 2

work page 2024
[36]

GPT4GEO: How a Language Model Sees the World’s Geography

Jonathan Roberts. GPT4GEO: How a Language Model Sees the World’s Geography. In Foundation Models for Decision Making Workshop at NeurIPS 2023., 2023. 7

work page 2023
[37]

Optimization of Fully Convolutional Network for Road Safety Attribute De- tection

Pubudu Sanjeewani and Brijesh Verma. Optimization of Fully Convolutional Network for Road Safety Attribute De- tection. IEEE Access, 9:120525–120536, 2021. 2

work page 2021
[38]

Single class detection-based deep learning approach for identification of road safety attributes

Pubudu Sanjeewani and Brijesh Verma. Single class detection-based deep learning approach for identification of road safety attributes. Neural Computing and Applications , 33(15):9691–9702, 2021. 2

work page 2021
[39]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very Deep Convo- lutional Networks for Large-Scale Image Recognition, 2015. arXiv:1409.1556 [cs]. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

FARSA: Fully Automated Roadway Safety Assess- ment

Weilian Song, Scott Workman, Armin Hadzic, Xu Zhang, Eric Green, Mei Chen, Reginald Souleyrette, and Nathan Ja- cobs. FARSA: Fully Automated Roadway Safety Assess- ment. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 521–529, Lake Tahoe, NV ,

work page 2018
[41]

Investigating the potential of crowd- sourced street-level imagery in understanding the spatiotem- poral dynamics of cities: A case study of walkability in Inner London

Meihui Wang, James Haworth, Huanfa Chen, Yunzhe Liu, and Zhengxiang Shi. Investigating the potential of crowd- sourced street-level imagery in understanding the spatiotem- poral dynamics of cities: A case study of walkability in Inner London. Cities, 153:105243, 2024. 8

work page 2024
[42]

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving, 2023

Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, and Yu Qiao. On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving, 2023. arXiv:2311.05332 [cs]. 2

work page arXiv 2023
[43]

Global status report on road safety 2023

WHO. Global status report on road safety 2023. Technical report, World Health Organization, Geneva, 2023. 1

work page 2023
[44]

A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter

Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter. Computers & Graphics, 119:103885, 2024. 2

work page 2024
[45]

DriveGPT4-V2: Harnessing Large Language Model Capa- bilities for Enhanced Closed-Loop Autonomous Driving

Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. DriveGPT4-V2: Harnessing Large Language Model Capa- bilities for Enhanced Closed-Loop Autonomous Driving. In CVPR 2025, 2025. 2

work page 2025
[46]

Multimodal Deep Learning for Robust Road Attribute Detection

Yifang Yin, Wenmiao Hu, An Tran, Ying Zhang, Guanfeng Wang, Hannes Kruppa, Roger Zimmermann, and See-Kiong Ng. Multimodal Deep Learning for Robust Road Attribute Detection. ACM Transactions on Spatial Algorithms and Systems, 9(4):1–25, 2023. 2

work page 2023
[47]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Under- standing and Reasoning Benchmark for...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4

Zichao Zeng, June Moh Goo, Xinglei Wang, Bin Chi, Mei- hui Wang, and Jan Boehm. Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4. The International Archives of the Photogrammetry, Remote Sensing and Spa- tial Information Sciences, XLVIII-2-2024:457–464, 2024. 2

work page 2024
[49]

ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles

Jiawei Zhang, Chejian Xu, and Bo Li. ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 15459–15469, Seattle, W A, USA, 2024. IEEE. 2 Appendix Table 3. Performance comparison across all iRAP-defined attributes using four models: VGG...

work page 2024

[1] [1]

Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review

Abolfazl Abdollahi, Biswajeet Pradhan, Nagesh Shukla, Subrata Chakraborty, and Abdullah Alamri. Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review. Remote Sensing , 12(9):1444, 2020. 2

work page 2020

[2] [2]

VQA: Visual Question Answering

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Mar- garet Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering, 2016. arXiv:1505.00468 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

RDD2022: A multi- national image dataset for automatic road damage detection

Deeksha Arya, Hiroya Maeda, Sanjay Kumar Ghosh, Durga Toshniwal, and Yoshihide Sekimoto. RDD2022: A multi- national image dataset for automatic road damage detection. Geoscience Data Journal, page gdj3.260, 2024. 2

work page 2024

[4] [4]

An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos

Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos. Remote Sensing, 12(22):3844, 2020. 2

work page 2020

[5] [5]

Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images

Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images. Sensors, 22(15):5510, 2022. 2

work page 2022

[6] [6]

Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments

Ivan Brki ´c, Marko ˇSevrovi´c, Damir Medak, and Mario Miler. Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments. Sensors, 23(9): 4405, 2023. 2

work page 2023

[7] [7]

The global macroeconomic burden of road injuries: estimates and projections for 166 countries

Simiao Chen, Michael Kuhn, Klaus Prettner, and David E Bloom. The global macroeconomic burden of road injuries: estimates and projections for 166 countries. The Lancet Planetary Health, 3(9):e390–e398, 2019. 1

work page 2019

[8] [8]

MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025

Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025. arXiv:2501.00316 [cs]. 2

work page arXiv 2025

[9] [9]

Vision meets robotics: The KITTI dataset

A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 2

work page 2013

[10] [10]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team Gemini. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024. arXiv:2403.05530 [cs]. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model

June Moh Goo, Zichao Zeng, and Jan Boehm. Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model. The International Archives of the Photogram- metry, Remote Sensing and Spatial Information Sciences , XLVIII-2-2024:107–113, 2024. 2

work page 2024

[12] [12]

Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure

June Moh Goo, Xenios Milidonis, Alessandro Artusi, Jan Boehm, and Carlo Ciliberto. Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure. Automation in Construction , 170: 105960, 2025. 2

work page 2025

[13] [13]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015. arXiv:1512.03385 [cs]. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A Baseline for Detect- ing Misclassified and Out-of-Distribution Examples in Neu- ral Networks, 2018. arXiv:1610.02136 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring Massive Multitask Language Understanding, 2021. arXiv:2009.03300 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Global Streetscapes — A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global Streetscapes — A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogram- metry and Remote Sensing, 215:216–238, 2024. 8

work page 2024

[17] [17]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, 2019. arXiv:1902.09506 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Automated pavement distress detection using region based convolutional neural networks

Eldor Ibragimov, Hyun-Jong Lee, Jong-Jae Lee, and Nam- gyu Kim. Automated pavement distress detection using region based convolutional neural networks. International Journal of Pavement Engineering , 23(6):1981–1992, 2022. 2

work page 1981

[19] [19]

CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images

Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, and James Haworth. CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images, 2025. arXiv:2506.12214 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Lynn Abbott, and Abhijit Sarkar

Sandesh Jain, Surendrabikram Thapa, Kuan-Ting Chen, A. Lynn Abbott, and Abhijit Sarkar. Semantic Understand- ing of Traffic Scenes with Large Vision Language Models. In 2024 IEEE Intelligent Vehicles Symposium (IV), pages 1580– 1587, Jeju Island, Korea, Republic of, 2024. IEEE. 2

work page 2024

[21] [21]

A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes

Zohaib Jan, Brijesh Verma, Joseph Affum, Sam Atabak, and Lachlan Moir. A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes. In2018 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Auckland, New Zealand,

work page

[22] [22]

Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera

Natchapon Jongwiriyanurak, Zichao Zeng, Meihui Wang, James Haworth, Garavig Tanaksaranond, and Jan Boehm. Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera. In 12th International Conference on Ge- ographic Information Science (GIScience 2023), 2023. 2

work page 2023

[23] [23]

Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment

Marin Kacan, Marin Orsic, Sinisa Segvic, and Marko Sevrovic. Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment. In 2020 IEEE 23rd International Conference on Intelligent Transportation Sys- tems (ITSC), pages 1–6, Rhodes, Greece, 2020. IEEE. 2, 6

work page 2020

[24] [24]

Dynamic Loss Balancing and Sequential Enhancement for Road- Safety Assessment and Traffic Scene Classification

Marin Ka ˇcan, Marko ˇSevrovi´c, and Siniˇsa ˇSegvi´c. Dynamic Loss Balancing and Sequential Enhancement for Road- Safety Assessment and Traffic Scene Classification. IEEE Transactions on Intelligent Transportation Systems, 25(11): 15628–15640, 2024. 2, 6

work page 2024

[25] [25]

Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation

Changjae Kim, Seunghun Lee, and Sunghoon Im. Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation. In BMVC, 2023. 2

work page 2023

[26] [26]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2021. arXiv:2005.11401 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025

Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, and Filip Biljecki. OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025. arXiv:2504.02866 [cs]. 2

work page arXiv 2025

[28] [28]

DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries

Chunmian Lin, Daxin Tian, Xuting Duan, Jianshan Zhou, Dezong Zhao, and Dongpu Cao. DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries. IEEE Transactions on Intelligent Transportation Sys- tems, 24(3):3091–3103, 2023. 2

work page 2023

[29] [29]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning, 2023. arXiv:2304.08485 [cs]. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms

Nachuan Ma, Jiahe Fan, Wenshuo Wang, Jin Wu, Yu Jiang, Lihua Xie, and Rui Fan. Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms. Transportation Safety and Environment, 4 (4):tdac026, 2022. 2

work page 2022

[31] [31]

OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019. arXiv:1906.00067 [cs]. 2

work page arXiv 2019

[32] [32]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report, 2024. arXiv:2303.08774 [cs]. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

GPT-4o System Card, 2024

OpenAI. GPT-4o System Card, 2024. 2

work page 2024

[34] [34]

Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety

Thihagoda Gamage Pubudu Sanjeewani and Brijesh Verma. Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety. In 2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Dunedin, New Zealand, 2019. IEEE. 2

work page 2019

[35] [35]

NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario. Proceedings of the AAAI Conference on Artificial In- telligence, 38(5):4542–4550, 2024. 2

work page 2024

[36] [36]

GPT4GEO: How a Language Model Sees the World’s Geography

Jonathan Roberts. GPT4GEO: How a Language Model Sees the World’s Geography. In Foundation Models for Decision Making Workshop at NeurIPS 2023., 2023. 7

work page 2023

[37] [37]

Optimization of Fully Convolutional Network for Road Safety Attribute De- tection

Pubudu Sanjeewani and Brijesh Verma. Optimization of Fully Convolutional Network for Road Safety Attribute De- tection. IEEE Access, 9:120525–120536, 2021. 2

work page 2021

[38] [38]

Single class detection-based deep learning approach for identification of road safety attributes

Pubudu Sanjeewani and Brijesh Verma. Single class detection-based deep learning approach for identification of road safety attributes. Neural Computing and Applications , 33(15):9691–9702, 2021. 2

work page 2021

[39] [39]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very Deep Convo- lutional Networks for Large-Scale Image Recognition, 2015. arXiv:1409.1556 [cs]. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

FARSA: Fully Automated Roadway Safety Assess- ment

Weilian Song, Scott Workman, Armin Hadzic, Xu Zhang, Eric Green, Mei Chen, Reginald Souleyrette, and Nathan Ja- cobs. FARSA: Fully Automated Roadway Safety Assess- ment. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 521–529, Lake Tahoe, NV ,

work page 2018

[41] [41]

Investigating the potential of crowd- sourced street-level imagery in understanding the spatiotem- poral dynamics of cities: A case study of walkability in Inner London

Meihui Wang, James Haworth, Huanfa Chen, Yunzhe Liu, and Zhengxiang Shi. Investigating the potential of crowd- sourced street-level imagery in understanding the spatiotem- poral dynamics of cities: A case study of walkability in Inner London. Cities, 153:105243, 2024. 8

work page 2024

[42] [42]

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving, 2023

Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, and Yu Qiao. On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving, 2023. arXiv:2311.05332 [cs]. 2

work page arXiv 2023

[43] [43]

Global status report on road safety 2023

WHO. Global status report on road safety 2023. Technical report, World Health Organization, Geneva, 2023. 1

work page 2023

[44] [44]

A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter

Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter. Computers & Graphics, 119:103885, 2024. 2

work page 2024

[45] [45]

DriveGPT4-V2: Harnessing Large Language Model Capa- bilities for Enhanced Closed-Loop Autonomous Driving

Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. DriveGPT4-V2: Harnessing Large Language Model Capa- bilities for Enhanced Closed-Loop Autonomous Driving. In CVPR 2025, 2025. 2

work page 2025

[46] [46]

Multimodal Deep Learning for Robust Road Attribute Detection

Yifang Yin, Wenmiao Hu, An Tran, Ying Zhang, Guanfeng Wang, Hannes Kruppa, Roger Zimmermann, and See-Kiong Ng. Multimodal Deep Learning for Robust Road Attribute Detection. ACM Transactions on Spatial Algorithms and Systems, 9(4):1–25, 2023. 2

work page 2023

[47] [47]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Under- standing and Reasoning Benchmark for...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4

Zichao Zeng, June Moh Goo, Xinglei Wang, Bin Chi, Mei- hui Wang, and Jan Boehm. Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4. The International Archives of the Photogrammetry, Remote Sensing and Spa- tial Information Sciences, XLVIII-2-2024:457–464, 2024. 2

work page 2024

[49] [49]

ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles

Jiawei Zhang, Chejian Xu, and Bo Li. ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 15459–15469, Seattle, W A, USA, 2024. IEEE. 2 Appendix Table 3. Performance comparison across all iRAP-defined attributes using four models: VGG...

work page 2024