V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?
Pith reviewed 2026-05-23 22:03 UTC · model grok-4.3
The pith
Vision-language models can classify iRAP road safety attributes from single street images without any training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision-language models can perform zero-shot visual question answering to classify road safety attributes according to the iRAP standard, generalizing to new classes without retraining while underperforming on spatial tasks, as shown on the new ThaiRAP image dataset.
What carries the argument
The V-RoAst zero-shot VQA framework that converts iRAP attribute definitions into natural-language questions posed to VLMs on individual street-level images.
If this is right
- Road safety ratings can be generated automatically for previously unassessed roads without collecting new labeled training data.
- Prompt changes allow the same model to handle new regions or updated iRAP criteria without retraining.
- Integration with complementary data sources can compensate for the models' spatial weaknesses.
- Low-cost mapping of infrastructure risks becomes feasible in areas that lack expert assessors.
Where Pith is reading between the lines
- Pairing single-image VLM outputs with map layers or multi-view captures could address the documented spatial shortfalls.
- The method might extend to other infrastructure rating systems beyond iRAP by swapping the question set.
- Selective human review triggered only on low-confidence VLM responses could balance cost and reliability in practice.
Load-bearing premise
A single street-level image holds enough visual detail for reliable iRAP attribute classification and VLM answers can be used without extra human checks or sensors.
What would settle it
A set of expert-verified images where the VLM consistently fails to detect the presence or absence of a specific iRAP feature such as guardrails or pedestrian crossings.
Figures
read the original abstract
Road safety assessments are critical yet costly, especially in Low- and Middle-Income Countries (LMICs), where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning-based approaches struggle to generalise across regions. In this paper, we introduce \textit{V-RoAst}, a zero-shot Visual Question Answering (VQA) framework using Vision-Language Models (VLMs) to classify road safety attributes defined by the iRAP standard. We introduce the first open-source dataset from ThaiRAP, consisting of over 2,000 curated street-level images from Thailand annotated for this task. We evaluate Gemini-1.5-flash and GPT-4o-mini on this dataset and benchmark their performance against VGGNet and ResNet baselines. While VLMs underperform on spatial awareness, they generalise well to unseen classes and offer flexible prompt-based reasoning without retraining. Our results show that VLMs can serve as automatic road assessment tools when integrated with complementary data. This work is the first to explore VLMs for zero-shot infrastructure risk assessment and opens new directions for automatic, low-cost road safety mapping. Code and dataset: https://github.com/PongNJ/V-RoAst.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces V-RoAst, a zero-shot VQA framework using VLMs (Gemini-1.5-flash and GPT-4o-mini) to classify iRAP road safety attributes from street-level images. It releases the first open-source ThaiRAP dataset (>2,000 curated images from Thailand) and benchmarks the VLMs against VGGNet and ResNet baselines. The authors observe that VLMs underperform on spatial awareness yet generalize to unseen classes, and conclude that VLMs can serve as automatic road assessment tools when integrated with complementary data; the work is positioned as the first exploration of VLMs for zero-shot infrastructure risk assessment.
Significance. If supported by quantitative evidence, the work could be significant for enabling low-cost, scalable road safety mapping in LMICs where most roads remain unrated. The release of the ThaiRAP dataset provides a reusable resource, and the zero-shot prompt-based approach avoids the region-specific retraining required by supervised baselines. The qualified claim (integration with complementary data) could open practical directions if the integration mechanism and error compensation are demonstrated.
major comments (3)
- [Abstract] Abstract: performance claims are stated only qualitatively (VLMs 'underperform on spatial awareness' yet 'generalise well to unseen classes') with no per-attribute accuracy, precision/recall, confusion matrices, or numeric comparison to the VGGNet/ResNet baselines. Without these metrics it is impossible to assess whether the observed generalization is sufficient for the central claim that VLMs can serve as road assessment tools.
- [Abstract] Abstract (final paragraph): the headline result that 'VLMs can serve as automatic road assessment tools when integrated with complementary data' is not accompanied by any description of the integration mechanism, any quantitative evidence that complementary data compensates for spatial errors, or any evaluation of the combined system. Many iRAP attributes (curvature, lane width, roadside hazards, sight distance) are inherently spatial, so the noted spatial-awareness failures remain load-bearing.
- [Abstract] Abstract: the assumption that single street-level images contain sufficient visual information for reliable iRAP attribute classification is stated without supporting evidence or discussion of failure modes; the paper does not address how VLM responses would be trusted for safety-critical decisions without human verification or additional sensors.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas where the abstract can be strengthened with quantitative details and clearer qualifications. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance claims are stated only qualitatively (VLMs 'underperform on spatial awareness' yet 'generalise well to unseen classes') with no per-attribute accuracy, precision/recall, confusion matrices, or numeric comparison to the VGGNet/ResNet baselines. Without these metrics it is impossible to assess whether the observed generalization is sufficient for the central claim that VLMs can serve as road assessment tools.
Authors: We agree that the abstract should include quantitative metrics to support the claims. The full manuscript (Section 4 and supplementary material) reports per-attribute accuracies, precision, recall, and direct numeric comparisons to VGGNet and ResNet baselines, including confusion matrices for key attributes. We will revise the abstract to incorporate key numeric results (e.g., overall accuracy figures and generalization gaps) while retaining the qualitative summary of spatial vs. generalization performance. revision: yes
-
Referee: [Abstract] Abstract (final paragraph): the headline result that 'VLMs can serve as automatic road assessment tools when integrated with complementary data' is not accompanied by any description of the integration mechanism, any quantitative evidence that complementary data compensates for spatial errors, or any evaluation of the combined system. Many iRAP attributes (curvature, lane width, roadside hazards, sight distance) are inherently spatial, so the noted spatial-awareness failures remain load-bearing.
Authors: The manuscript presents the integration statement as a forward-looking qualified claim rather than a result demonstrated in this work, which focuses on zero-shot VLM evaluation. We acknowledge the abstract phrasing implies more support than is provided. We will revise the final paragraph to explicitly state that demonstrating integration mechanisms and error compensation with complementary data (e.g., for spatial attributes) is proposed as future work, removing any implication of current evidence. revision: yes
-
Referee: [Abstract] Abstract: the assumption that single street-level images contain sufficient visual information for reliable iRAP attribute classification is stated without supporting evidence or discussion of failure modes; the paper does not address how VLM responses would be trusted for safety-critical decisions without human verification or additional sensors.
Authors: The approach follows the iRAP standard's reliance on street-level imagery, but we agree the abstract lacks explicit discussion of limitations. We will add a limitations section to the revised manuscript that covers failure modes (including spatial reasoning), notes that single-image inputs may be insufficient for certain attributes, and emphasizes that VLM outputs are intended as assistive tools requiring human oversight and potential multi-sensor validation for safety-critical use. revision: yes
Circularity Check
Empirical evaluation study with no derivation chain or self-referential claims
full rationale
The paper introduces a new dataset (ThaiRAP) and performs direct empirical evaluation of VLMs (Gemini-1.5-flash, GPT-4o-mini) against CNN baselines (VGGNet, ResNet) for zero-shot iRAP attribute classification. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the central claims. Results are reported from explicit experiments on the held-out test set; the qualified conclusion that VLMs 'can serve as automatic road assessment tools when integrated with complementary data' rests on observed metrics rather than any definitional or self-citation reduction. This is a standard self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption VLMs can perform zero-shot visual question answering on road scene images without domain-specific fine-tuning.
- domain assumption iRAP safety attributes are visually discernible from single street-level photographs.
Forward citations
Cited by 1 Pith paper
-
CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images
A lightweight multi-modal CLIP pipeline predicts exact-match geographical tags on a Kaggle subset of the Geograph crowdsourced image archive by fusing image, location, and title embeddings.
Reference graph
Works this paper leans on
-
[1]
Abolfazl Abdollahi, Biswajeet Pradhan, Nagesh Shukla, Subrata Chakraborty, and Abdullah Alamri. Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review. Remote Sensing , 12(9):1444, 2020. 2
work page 2020
-
[2]
VQA: Visual Question Answering
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Mar- garet Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering, 2016. arXiv:1505.00468 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
RDD2022: A multi- national image dataset for automatic road damage detection
Deeksha Arya, Hiroya Maeda, Sanjay Kumar Ghosh, Durga Toshniwal, and Yoshihide Sekimoto. RDD2022: A multi- national image dataset for automatic road damage detection. Geoscience Data Journal, page gdj3.260, 2024. 2
work page 2024
-
[4]
An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos
Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. An Analytical Framework for Accurate Traffic Flow Param- eter Calculation from UA V Aerial Videos. Remote Sensing, 12(22):3844, 2020. 2
work page 2020
-
[5]
Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images
Ivan Brki ´c, Mario Miler, MarkoˇSevrovi´c, and Damir Medak. Automatic Roadside Feature Detection Based on Lidar Road Cross Section Images. Sensors, 22(15):5510, 2022. 2
work page 2022
-
[6]
Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments
Ivan Brki ´c, Marko ˇSevrovi´c, Damir Medak, and Mario Miler. Utilizing High Resolution Satellite Imagery for Automated Road Infrastructure Safety Assessments. Sensors, 23(9): 4405, 2023. 2
work page 2023
-
[7]
The global macroeconomic burden of road injuries: estimates and projections for 166 countries
Simiao Chen, Michael Kuhn, Klaus Prettner, and David E Bloom. The global macroeconomic burden of road injuries: estimates and projections for 166 countries. The Lancet Planetary Health, 3(9):e390–e398, 2019. 1
work page 2019
-
[8]
MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025
Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025. arXiv:2501.00316 [cs]. 2
-
[9]
Vision meets robotics: The KITTI dataset
A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 2
work page 2013
-
[10]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team Gemini. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024. arXiv:2403.05530 [cs]. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model
June Moh Goo, Zichao Zeng, and Jan Boehm. Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model. The International Archives of the Photogram- metry, Remote Sensing and Spatial Information Sciences , XLVIII-2-2024:107–113, 2024. 2
work page 2024
-
[12]
June Moh Goo, Xenios Milidonis, Alessandro Artusi, Jan Boehm, and Carlo Ciliberto. Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure. Automation in Construction , 170: 105960, 2025. 2
work page 2025
-
[13]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015. arXiv:1512.03385 [cs]. 6
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[14]
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
Dan Hendrycks and Kevin Gimpel. A Baseline for Detect- ing Misclassified and Out-of-Distribution Examples in Neu- ral Networks, 2018. arXiv:1610.02136 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring Massive Multitask Language Understanding, 2021. arXiv:2009.03300 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global Streetscapes — A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogram- metry and Remote Sensing, 215:216–238, 2024. 8
work page 2024
-
[17]
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, 2019. arXiv:1902.09506 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[18]
Automated pavement distress detection using region based convolutional neural networks
Eldor Ibragimov, Hyun-Jong Lee, Jong-Jae Lee, and Nam- gyu Kim. Automated pavement distress detection using region based convolutional neural networks. International Journal of Pavement Engineering , 23(6):1981–1992, 2022. 2
work page 1981
-
[19]
CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images
Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, and James Haworth. CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images, 2025. arXiv:2506.12214 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Lynn Abbott, and Abhijit Sarkar
Sandesh Jain, Surendrabikram Thapa, Kuan-Ting Chen, A. Lynn Abbott, and Abhijit Sarkar. Semantic Understand- ing of Traffic Scenes with Large Vision Language Models. In 2024 IEEE Intelligent Vehicles Symposium (IV), pages 1580– 1587, Jeju Island, Korea, Republic of, 2024. IEEE. 2
work page 2024
-
[21]
A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes
Zohaib Jan, Brijesh Verma, Joseph Affum, Sam Atabak, and Lachlan Moir. A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes. In2018 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Auckland, New Zealand,
-
[22]
Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera
Natchapon Jongwiriyanurak, Zichao Zeng, Meihui Wang, James Haworth, Garavig Tanaksaranond, and Jan Boehm. Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera. In 12th International Conference on Ge- ographic Information Science (GIScience 2023), 2023. 2
work page 2023
-
[23]
Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment
Marin Kacan, Marin Orsic, Sinisa Segvic, and Marko Sevrovic. Multi-Task Learning for iRAP Attribute Classi- fication and Road Safety Assessment. In 2020 IEEE 23rd International Conference on Intelligent Transportation Sys- tems (ITSC), pages 1–6, Rhodes, Greece, 2020. IEEE. 2, 6
work page 2020
-
[24]
Marin Ka ˇcan, Marko ˇSevrovi´c, and Siniˇsa ˇSegvi´c. Dynamic Loss Balancing and Sequential Enhancement for Road- Safety Assessment and Traffic Scene Classification. IEEE Transactions on Intelligent Transportation Systems, 25(11): 15628–15640, 2024. 2, 6
work page 2024
-
[25]
Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation
Changjae Kim, Seunghun Lee, and Sunghoon Im. Multi- Target Domain Adaptation with Class-Wise Attribute Trans- fer in Semantic Segmentation. In BMVC, 2023. 2
work page 2023
-
[26]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2021. arXiv:2005.11401 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, and Filip Biljecki. OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025. arXiv:2504.02866 [cs]. 2
-
[28]
DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries
Chunmian Lin, Daxin Tian, Xuting Duan, Jianshan Zhou, Dezong Zhao, and Dongpu Cao. DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Coun- tries. IEEE Transactions on Intelligent Transportation Sys- tems, 24(3):3091–3103, 2023. 2
work page 2023
-
[29]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning, 2023. arXiv:2304.08485 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Nachuan Ma, Jiahe Fan, Wenshuo Wang, Jin Wu, Yu Jiang, Lihua Xie, and Rui Fan. Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms. Transportation Safety and Environment, 4 (4):tdac026, 2022. 2
work page 2022
-
[31]
OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question An- swering Benchmark Requiring External Knowledge, 2019. arXiv:1906.00067 [cs]. 2
-
[32]
OpenAI. GPT-4 Technical Report, 2024. arXiv:2303.08774 [cs]. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [33]
-
[34]
Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety
Thihagoda Gamage Pubudu Sanjeewani and Brijesh Verma. Learning and Analysis of AusRAP Attributes from Digital Video Recording for Road Safety. In 2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, Dunedin, New Zealand, 2019. IEEE. 2
work page 2019
-
[35]
NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A Multi-Modal Visual Ques- tion Answering Benchmark for Autonomous Driving Sce- nario. Proceedings of the AAAI Conference on Artificial In- telligence, 38(5):4542–4550, 2024. 2
work page 2024
-
[36]
GPT4GEO: How a Language Model Sees the World’s Geography
Jonathan Roberts. GPT4GEO: How a Language Model Sees the World’s Geography. In Foundation Models for Decision Making Workshop at NeurIPS 2023., 2023. 7
work page 2023
-
[37]
Optimization of Fully Convolutional Network for Road Safety Attribute De- tection
Pubudu Sanjeewani and Brijesh Verma. Optimization of Fully Convolutional Network for Road Safety Attribute De- tection. IEEE Access, 9:120525–120536, 2021. 2
work page 2021
-
[38]
Single class detection-based deep learning approach for identification of road safety attributes
Pubudu Sanjeewani and Brijesh Verma. Single class detection-based deep learning approach for identification of road safety attributes. Neural Computing and Applications , 33(15):9691–9702, 2021. 2
work page 2021
-
[39]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very Deep Convo- lutional Networks for Large-Scale Image Recognition, 2015. arXiv:1409.1556 [cs]. 6
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[40]
FARSA: Fully Automated Roadway Safety Assess- ment
Weilian Song, Scott Workman, Armin Hadzic, Xu Zhang, Eric Green, Mei Chen, Reginald Souleyrette, and Nathan Ja- cobs. FARSA: Fully Automated Roadway Safety Assess- ment. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 521–529, Lake Tahoe, NV ,
work page 2018
-
[41]
Meihui Wang, James Haworth, Huanfa Chen, Yunzhe Liu, and Zhengxiang Shi. Investigating the potential of crowd- sourced street-level imagery in understanding the spatiotem- poral dynamics of cities: A case study of walkability in Inner London. Cities, 153:105243, 2024. 8
work page 2024
-
[42]
Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, and Yu Qiao. On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving, 2023. arXiv:2311.05332 [cs]. 2
-
[43]
Global status report on road safety 2023
WHO. Global status report on road safety 2023. Technical report, World Health Organization, Geneva, 2023. 1
work page 2023
-
[44]
A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter
Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for Vision-Language Models — Prompt and Adapter. Computers & Graphics, 119:103885, 2024. 2
work page 2024
-
[45]
Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. DriveGPT4-V2: Harnessing Large Language Model Capa- bilities for Enhanced Closed-Loop Autonomous Driving. In CVPR 2025, 2025. 2
work page 2025
-
[46]
Multimodal Deep Learning for Robust Road Attribute Detection
Yifang Yin, Wenmiao Hu, An Tran, Ying Zhang, Guanfeng Wang, Hannes Kruppa, Roger Zimmermann, and See-Kiong Ng. Multimodal Deep Learning for Robust Road Attribute Detection. ACM Transactions on Spatial Algorithms and Systems, 9(4):1–25, 2023. 2
work page 2023
-
[47]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Under- standing and Reasoning Benchmark for...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4
Zichao Zeng, June Moh Goo, Xinglei Wang, Bin Chi, Mei- hui Wang, and Jan Boehm. Zero-Shot Building Age Classi- fication from Facade Image Using GPT-4. The International Archives of the Photogrammetry, Remote Sensing and Spa- tial Information Sciences, XLVIII-2-2024:457–464, 2024. 2
work page 2024
-
[49]
ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles
Jiawei Zhang, Chejian Xu, and Bo Li. ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 15459–15469, Seattle, W A, USA, 2024. IEEE. 2 Appendix Table 3. Performance comparison across all iRAP-defined attributes using four models: VGG...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.