arxiv: 2604.05210 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

Muhammad Adil , Mehmood Ahmed , Muhammad Aqib , Vicente A. Gonzalez , Gaang Lee , Qipei Mei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords construction safetyhazard detectionsmall vision-language modelsobject detectionYOLOmultimodal reasoningsafety hazard identification

0 comments

The pith

Integrating object detection with small vision-language models boosts construction hazard detection performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that uses a YOLO object detector to identify workers and machinery in construction scenes and then incorporates these detections into prompts for small vision-language models. This guidance helps the models reason more effectively about potential hazards, leading to higher detection accuracy and better quality explanations compared to using the models alone. By testing this on six different small VLMs with a dataset of annotated construction images, the authors show consistent improvements, with the best model achieving an F1-score of 50.6 percent versus 34.5 percent in the unguided case. The method adds very little processing time, making it suitable for practical safety applications on job sites where fast responses are needed.

Core claim

A detection-guided sVLM framework that first applies YOLOv11n to localize workers and construction machinery, then uses these localizations to build structured prompts for guiding sVLM reasoning, resulting in improved hazard identification and explanatory rationales on construction site images.

What carries the argument

Structured prompting based on YOLOv11n detections, which provides spatial context to small VLMs for more accurate multimodal hazard assessment.

If this is right

Hazard detection F1-scores improve for every tested small VLM.
Explanation quality, measured by BERTScore, increases from 0.61 to 0.82 for the best model.
Inference overhead is minimal at 2.5 milliseconds per image.
The framework enables efficient context-aware safety monitoring in construction environments.
Zero-shot performance gains demonstrate the value of detection guidance without model fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique could be applied to other visual safety tasks, such as identifying hazards in manufacturing or warehouse settings.
Future experiments might test the framework on live video streams to assess real-time performance.
The approach may reduce hallucinations in sVLMs by anchoring reasoning to detected objects.
Combining this with larger models or additional sensors could further enhance reliability in variable site conditions.

Load-bearing premise

The assumption that the YOLO detections are accurate enough and that embedding them into prompts will always improve sVLM performance without creating new errors or missing hazards outside the detected categories.

What would settle it

Running the baseline and guided versions on a new collection of construction images featuring hazards in scenes with poor object detection results, such as occluded workers or unusual equipment, and observing no consistent improvement in F1-score or explanation quality.

read the original abstract

Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that feeding YOLOv11n detections into structured prompts lifts F1 and explanation scores for several small VLMs on construction images, but the gains rest on an unexamined curated dataset and no checks for detector errors.

read the letter

The core result is that this detection-guided prompting improves hazard F1 across six sVLMs, with Gemma-3 4B moving from 34.5% to 50.6% and BERTScore from 0.61 to 0.82, all while adding only 2.5 ms per image. The approach is a direct combination of an off-the-shelf detector and zero-shot sVLM inference rather than a new algorithm, but it is new for the construction-safety setting and they supply concrete numbers on multiple models.

Referee Report

3 major / 2 minor

Summary. The paper proposes integrating YOLOv11n object detection outputs into structured prompts to guide small VLMs (<4B parameters) for zero-shot construction-site hazard identification and rationale generation. On a curated dataset of annotated construction images, the detection-guided approach yields consistent F1 gains across six sVLMs (e.g., Gemma-3 4B: 50.6% vs. 34.5% baseline) and improved explanation quality (BERTScore F1 0.82 vs. 0.61), while adding only 2.5 ms per image.

Significance. If the empirical gains prove robust, the work demonstrates a lightweight, deployable route to context-aware hazard reasoning that balances the efficiency of sVLMs against the accuracy of larger models. The negligible overhead and cross-model consistency suggest practical value for real-time safety monitoring; the prompting technique may also transfer to other grounded reasoning domains.

major comments (3)

[Abstract] Abstract and evaluation section: performance claims rest on a 'curated dataset of construction site images with hazard annotations and explanatory rationales,' yet no cardinality, diversity statistics (lighting, occlusion, camera angles, hazard co-occurrence), inter-annotator agreement, or train/test split details are supplied. Without these, the representativeness assumption cannot be assessed and the reported F1/BERTScore deltas remain unverifiable.
[Framework / Experiments] Framework description and experiments: the central claim that embedding YOLO detections 'reliably improve sVLM reasoning without introducing new failure modes' is untested. No ablation on imperfect detections (misses, false positives, or localization noise) is presented, leaving open whether prompt construction amplifies YOLO errors in realistic scenes.
[Results] Results: the F1 (50.6% vs 34.5%) and BERTScore (0.82 vs 0.61) improvements are stated without error bars, confidence intervals, or statistical significance tests. This weakens the assertion of 'consistent' improvement across all six models.

minor comments (2)

[Methods] Clarify the exact zero-shot prompt templates and how detected bounding boxes are serialized into text; an example in the methods would aid reproducibility.
[Experiments] The overhead measurement (2.5 ms) should specify the hardware and whether it includes YOLO inference or only the added prompt construction step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of clarity, robustness, and statistical rigor that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation section: performance claims rest on a 'curated dataset of construction site images with hazard annotations and explanatory rationales,' yet no cardinality, diversity statistics (lighting, occlusion, camera angles, hazard co-occurrence), inter-annotator agreement, or train/test split details are supplied. Without these, the representativeness assumption cannot be assessed and the reported F1/BERTScore deltas remain unverifiable.

Authors: We agree that these dataset characteristics are necessary to allow readers to evaluate representativeness and reproducibility. In the revised manuscript we will expand the Experiments section with the total number of images in the curated dataset, quantitative diversity statistics covering lighting conditions, occlusion levels, camera angles, and hazard co-occurrence frequencies, inter-annotator agreement metrics, and explicit details on the train/test split. revision: yes
Referee: [Framework / Experiments] Framework description and experiments: the central claim that embedding YOLO detections 'reliably improve sVLM reasoning without introducing new failure modes' is untested. No ablation on imperfect detections (misses, false positives, or localization noise) is presented, leaving open whether prompt construction amplifies YOLO errors in realistic scenes.

Authors: We acknowledge that the manuscript does not contain an explicit ablation on detection errors. While the observed gains were consistent across six independent sVLMs, this does not directly test error propagation. In the revision we will add a dedicated limitations paragraph discussing potential amplification of YOLO misses or false positives and will include a controlled ablation that injects realistic detection noise into the prompts to quantify sensitivity. revision: yes
Referee: [Results] Results: the F1 (50.6% vs 34.5%) and BERTScore (0.82 vs 0.61) improvements are stated without error bars, confidence intervals, or statistical significance tests. This weakens the assertion of 'consistent' improvement across all six models.

Authors: We agree that the absence of variability measures and significance testing limits the strength of the consistency claim. In the revised Results section we will report standard deviations or bootstrap confidence intervals for each metric and will add paired statistical tests (e.g., McNemar or Wilcoxon signed-rank) across the six models to substantiate the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements on held-out data

full rationale

The paper describes a pipeline that runs YOLOv11n detection, inserts bounding-box information into fixed structured prompts, and then evaluates six sVLMs in zero-shot mode on a separately curated construction-image dataset. All reported numbers (F1 50.6 % vs 34.5 %, BERTScore 0.82 vs 0.61) are obtained by executing this pipeline on the test split and computing standard metrics; no parameters are fitted to the evaluation set, no quantity is defined in terms of itself, and no self-citation supplies a uniqueness theorem or ansatz that the central claim depends on. The derivation chain is therefore a standard experimental protocol whose outputs are independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard pretrained models and a newly curated evaluation set; no new free parameters are fitted and no novel entities are postulated.

axioms (1)

domain assumption Pretrained small VLMs can leverage spatially structured prompts to improve zero-shot hazard reasoning on construction scenes
The framework assumes the selected sVLMs possess sufficient base capability that detection-guided prompting yields the observed gains without fine-tuning.

pith-pipeline@v0.9.0 · 5636 in / 1407 out tokens · 91107 ms · 2026-05-10T19:16:14.831576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The framework first employs a YOLOv11n detector to localize workers and construction machinery... embedded into structured prompts to guide the reasoning process of sVLMs
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

six sVLMs... evaluated in zero-shot settings on a curated dataset of construction site images

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 51 canonical work pages · 11 internal anchors

[1]

https://awcbc.org (accessed April 24, 2025)

Association of Workers’ Compensation Boards of Canada, National Work Injury, Disease and Fatality Statistics, Association of Workers’ Compensation Boards of Canada, Ottawa, ON, 2023. https://awcbc.org (accessed April 24, 2025)

2023
[2]

Bureau of Labor Statistics, Census of Fatal Occupational Injuries, U.S

U.S. Bureau of Labor Statistics, Census of Fatal Occupational Injuries, U.S. Bureau of Labor Statistics, Washington, D.C., 2023. https://www.bls.gov/news.release/pdf/cfoi.pdf (accessed April 24, 2025)

2023
[3]

X. Hou, C. Li, Q. Fang, Computer vision -based safety risk computing and visualization on construction sites, Automation in Construction 156 (2023) 105129. https://doi.org/10.1016/j.autcon.2023.105129

work page doi:10.1016/j.autcon.2023.105129 2023
[4]

Zhang, J

S. Zhang, J. Teizer, J. -K. Lee, C.M. Eastman, M. Venugopal, Building Information Modeling (BIM) and Safety: Automatic Safety Checking of Construction Models and Schedules, Automation in Construction 29 (2013) 183 –195. https://doi.org/10.1016/j.autcon.2012.05.006

work page doi:10.1016/j.autcon.2012.05.006 2013
[5]

Zhang, K

S. Zhang, K. Sulankivi, M. Kiviniemi, I. Romo, C.M. Eastman, J. Teizer, BIM-based fall hazard identification and prevention in construction safety planning, Safety Science 72 (2015) 31–45. https://doi.org/10.1016/j.ssci.2014.08.001

work page doi:10.1016/j.ssci.2014.08.001 2015
[6]

W. Fang, L. Ding, P.E.D. Love, H. Luo, H. Li, F. Peñ a -Mora, B. Zhong, C. Zhou, Computer vision applications in construction safety assurance, Automation in Construction 110 (2020) 103013. https://doi.org/10.1016/j.autcon.2019.103013

work page doi:10.1016/j.autcon.2019.103013 2020
[7]

Z.-Q. Zhao, P. Zheng, S. Xu, X. Wu, Object Detection with Deep Learning: A Review, (2019). https://doi.org/10.48550/arXiv.1807.05511

work page doi:10.48550/arxiv.1807.05511 2019
[8]

Neuhausen, J

M. Neuhausen, J. Teizer, M. Kö nig, Construction Worker Detection and Tracking in Bird’s-Eye View Camera Images, in: Taipei, Taiwan, 2018. https://doi.org/10.22260/ISARC2018/0161

work page doi:10.22260/isarc2018/0161 2018
[9]

Nath, A.H

N.D. Nath, A.H. Behzadan, S.G. Paal, Deep learning for site safety: Real -time detection of personal protective equipment, Automation in Construction 112 (2020) 103085. https://doi.org/10.1016/j.autcon.2020.103085

work page doi:10.1016/j.autcon.2020.103085 2020
[10]

J. Kim, S. Chi, Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles, Automation in Construction 104 (2019) 255–264. https://doi.org/10.1016/j.autcon.2019.03.025

work page doi:10.1016/j.autcon.2019.03.025 2019
[11]

Sharifi, A

A. Sharifi, A. Zibaei, M. Rezaei, A deep learning based hazardous materials (HAZMAT) sign detection robot with restricted computational resources, Machine Learning with Applications 6 (2021) 100104. https://doi.org/10.1016/j.mlwa.2021.100104

work page doi:10.1016/j.mlwa.2021.100104 2021
[12]

Y. Wang, B. Xiao, A. Bouferguene, M. Al -Hussein, H. Li, Vision -based method for semantic information extraction in construction by integrating deep learning object detection and image captioning, Advanced Engineering Informatics 53 (2022) 101699. https://doi.org/10.1016/j.aei.2022.101699

work page doi:10.1016/j.aei.2022.101699 2022
[13]

S. Chi, S. Han, D.Y. Kim, Relationship between Unsafe Working Conditions and Workers’ Behavior and Impact of Working Conditions on Injury Severity in U.S. Construction Industry, Journal of Construction Engineering and Management 139 (2013) 826–838. https://doi.org/10.1061/(ASCE)CO.1943-7862.0000657

work page doi:10.1061/(asce)co.1943-7862.0000657 2013
[14]

Zhang, J

L. Zhang, J. Wang, Y. Wang, H. Sun, X. Zhao, Automatic construction site hazard identification integrating construction scene graphs with BERT based domain knowledge, 28 Automation in Construction 142 (2022) 104535. https://doi.org/10.1016/j.autcon.2022.104535

work page doi:10.1016/j.autcon.2022.104535 2022
[15]

arXiv preprint arXiv:2405.17247 , year=

F. Bordes, R.Y. Pang, A. Ajay, A.C. Li, et al. , An Introduction to Vision -Language Modeling, (2024). https://doi.org/10.48550/arXiv.2405.17247

work page doi:10.48550/arxiv.2405.17247 2024
[16]

B. Yang, B. Zhang, Y. Han, B. Liu, J. Hu, Y. Jin, Vision transformer -based visual language understanding of the construction process, Alexandria Engineering Journal 99 (2024) 242–256. https://doi.org/10.1016/j.aej.2024.05.015

work page doi:10.1016/j.aej.2024.05.015 2024
[17]

Z. Chen, H. Chen, M. Imani, R. Chen, F. Imani, Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces, Expert Systems with Applications 265 (2025) 125769. https://doi.org/10.1016/j.eswa.2024.125769

work page doi:10.1016/j.eswa.2024.125769 2025
[18]

Tsai, J.J

W.L. Tsai, J.J. Lin, S.-H. Hsieh, Generating Construction Safety Observations via CLIP- Based Image -Language Embedding, in: L. Karlinsky, T. Michaeli, K. Nishino (Eds.), Computer Vision – ECCV 2022 Workshops, Springer Nature Switzerland, Cham, 2023: pp. 366–381. https://doi.org/10.1007/978-3-031-25082-8_24

work page doi:10.1007/978-3-031-25082-8_24 2022
[19]

Q. Chen, X. Yin, Tailored Vision -Language Framework for Automated Hazard Identification And Report Generation in Construction Sites, (2025). https://doi.org/10.2139/ssrn.5137949

work page doi:10.2139/ssrn.5137949 2025
[20]

M. Adil, G. Lee, V.A. Gonzalez, Q. Mei, Using Vision Language Models for Safety Hazard Identification in Construction, (2025). https://doi.org/10.48550/arXiv.2504.09083

work page doi:10.48550/arxiv.2504.09083 2025
[21]

Chan, P.K.-Y

J.C.F. Chan, P.K.-Y. Wong, J.C.P. Cheng, X. Guo, J.P. -C. Chan, P. -H. Leung, X. Tao, Context-Aware Vision-Language Model Agent Enriched with Domain-Specific Ontology for Construction Site Safety Monitoring, (2025). https://doi.org/10.2139/ssrn.5080079

work page doi:10.2139/ssrn.5080079 2025
[22]

Sharshar, L.U

A. Sharshar, L.U. Khan, W. Ullah, M. Guizani, Vision -Language Models for Edge Networks: A Comprehensive Survey, IEEE Internet Things J. 12 (2025) 32701 –32724. https://doi.org/10.1109/JIOT.2025.3579032

work page doi:10.1109/jiot.2025.3579032 2025
[23]

Rezazadeh Azar, B

E. Rezazadeh Azar, B. McCabe, Automated Visual Recognition of Dump Trucks in Construction Videos, Journal of Computing in Civil Engineering 26 (2012) 769 –781. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000179

work page doi:10.1061/(asce)cp.1943-5487.0000179 2012
[24]

Tajeen, Z

H. Tajeen, Z. Zhu, Image dataset development for measuring construction equipment recognition performance, Automation in Construction 48 (2014) 1 –10. https://doi.org/10.1016/j.autcon.2014.07.006

work page doi:10.1016/j.autcon.2014.07.006 2014
[25]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. -Y. Fu, A.C. Berg, SSD: Single Shot MultiBox Detector, in: 2016: pp. 21 –37. https://doi.org/10.1007/978-3-319-46448- 0_2

work page doi:10.1007/978-3-319-46448- 2016
[26]

You Only Look Once: Unified, Real-Time Object Detection

J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real - Time Object Detection, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: pp. 779–788. https://doi.org/10.1109/CVPR.2016.91

work page doi:10.1109/cvpr.2016.91 2016
[27]

S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017) 1137–

2017
[28]

https://doi.org/10.1109/TPAMI.2016.2577031

work page doi:10.1109/tpami.2016.2577031 2016
[29]

J. Wu, N. Cai, W. Chen, H. Wang, G. Wang, Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset, Automation in Construction 106 (2019) 102894. https://doi.org/10.1016/j.autcon.2019.102894

work page doi:10.1016/j.autcon.2019.102894 2019
[30]

D. Kim, M. Liu, S. Lee, V.R. Kamat, Remote proximity monitoring between mobile construction resources using camera -mounted UAVs, Automation in Construction 99 (2019) 168–182. https://doi.org/10.1016/j.autcon.2018.12.014

work page doi:10.1016/j.autcon.2018.12.014 2019
[31]

H. Son, C. Kim, Integrated worker detection and tracking for the safe operation of construction machinery, Automation in Construction 126 (2021) 103670. https://doi.org/10.1016/j.autcon.2021.103670. 29

work page doi:10.1016/j.autcon.2021.103670 2021
[32]

Zhang, D

J. Zhang, D. Zhang, X. Liu, R. Liu, G. Zhong, A Framework of on -site Construction Safety Management Using Computer Vision and Real -Time Location System, in: International Conference on Smart Infrastructure and Construction 2019 (ICSIC), ICE Publishing, 2019: pp. 327–333. https://doi.org/10.1680/icsic.64669.327

work page doi:10.1680/icsic.64669.327 2019
[33]

Chian, W

E. Chian, W. Fang, Y.M. Goh, J. Tian, Computer vision approaches for detecting missing barricades, Automation in Construction 131 (2021) 103862. https://doi.org/10.1016/j.autcon.2021.103862

work page doi:10.1016/j.autcon.2021.103862 2021
[34]

Z. Xu, J. Huang, K. Huang, A novel computer vision -based approach for monitoring safety harness use in construction, IET Image Processing 17 (2023) 1071 –1085. https://doi.org/10.1049/ipr2.12696

work page doi:10.1049/ipr2.12696 2023
[35]

Ann, K.Y

H. Ann, K.Y. Koo, Deep Learning Based Fire Risk Detection on Construction Sites, Sensors 23 (2023) 9095. https://doi.org/10.3390/s23229095

work page doi:10.3390/s23229095 2023
[36]

https://www.ibm.com/think/topics/vision-language-models (accessed March 8, 2026)

What Are Vision Language Models (VLMs)? | IBM, (2025). https://www.ibm.com/think/topics/vision-language-models (accessed March 8, 2026)

2025
[37]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, (2021). https://doi.org/10.48550/arXiv.2103.00020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
[38]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M . Binkowski, R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a Visual Language Model fo...

work page internal anchor Pith review doi:10.48550/arxiv.2204.14198 2022
[39]

J. Li, D. Li, C. Xiong, S. Hoi, BLIP: Bootstrapping Language -Image Pre -training for Unified Vision -Language Understanding and Generation, (2022). https://doi.org/10.48550/arXiv.2201.12086

work page doi:10.48550/arxiv.2201.12086 2022
[40]

H. Liu, C. Li, Y. Li, Y.J. Lee, Improved Baselines with Visual Instruction Tuning, (2024). https://doi.org/10.48550/arXiv.2310.03744

work page internal anchor Pith review doi:10.48550/arxiv.2310.03744 2024
[41]

GPT-4 Technical Report

OpenAI, S. Adler, S. Agarwal, et al. , GPT -4 Technical Report, (2024). http://arxiv.org/abs/2303.08774 (accessed December 19, 2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

G. Team, P. Georgiev, et al., Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, (2024). https://doi.org/10.48550/arXiv.2403.05530

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530 2024
[43]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, et al. , The Llama 3 Herd of Models, (2024). https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[44]

A. Yang, A. Li, et al. , Qwen3 Technical Report, (2025). https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[45]

J. Zhu, W. Wang, et al., InternVL3: Exploring Advanced Training and Test-Time Recipes for Open -Source Multimodal Models, (2025). https://doi.org/10.48550/arXiv.2504.10479

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.10479 2025
[46]

Radford, J.W

A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, in: Proceedings of the 38th International Confere nce on Machine Learning, PMLR, 2021: pp. 8748 –8763. https://proceedings.mlr.press/v139/radford21a.ht...

2021
[47]

Y. Ding, M. Liu, X. Luo, Safety compliance checking of construction behaviors using visual question answering, Automation in Construction 144 (2022) 104580. https://doi.org/10.1016/j.autcon.2022.104580

work page doi:10.1016/j.autcon.2022.104580 2022
[48]

C. Fan, Q. Mei, X. Wang, X. Li, ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers, (2024). https://doi.org/10.48550/arXiv.2412.19954. 30

work page doi:10.48550/arxiv.2412.19954 2024
[50]

Li, B.C.M

M.Q. Li, B.C.M. Fung, Security concerns for Large Language Models: A survey, Journal of Information Security and Applications 95 (2025) 104284. https://doi.org/10.1016/j.jisa.2025.104284

work page doi:10.1016/j.jisa.2025.104284 2025
[51]

A. Rahman, A systematic review of vision language models: Comprehensive analysis of architectures, applications, datasets and challenges towards robust multimodal intelligence, Array (2026) 100739. https://doi.org/10.1016/j.array.2026.100739

work page doi:10.1016/j.array.2026.100739 2026
[52]

Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, and 1 others

G. Shinde, A. Ravi, E. Dey, S. Sakib, M. Rampure, N. Roy, A Survey on Efficient Vision- Language Models, (2025). https://doi.org/10.48550/arXiv.2504.09724

work page doi:10.48550/arxiv.2504.09724 2025
[53]

G. Team, A. Kamath, et al. Gemma 3 Technical Report, (2025). https://doi.org/10.48550/arXiv.2503.19786

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[54]

SmolVLM: Redefining small and efficient multimodal models

A. Marafioti, O. Zohar, M. Farré , M. Noyan, E. Bakouch, P. Cuenca, C. Zakka, L.B. Allal, A. Lozhkov, N. Tazi, V. Srivastav, J. Lochner, H. Larcher, M. Morlon, L. Tunstall, L. von Werra, T. Wolf, SmolVLM: Redefining small and efficient multimodal models, (2025). https://doi.org/10.48550/arXiv.2504.05299

work page internal anchor Pith review doi:10.48550/arxiv.2504.05299 2025
[55]

Patnaik, N

N. Patnaik, N. Nayak, H.B. Agrawal, M.C. Khamaru, G. Bal, S.S. Panda, R. Raj, V. Meena, K. Vadlamani, Small Vision -Language Models: A Survey on Compact Architectures and Techniques, (2025). https://doi.org/10.48550/arXiv.2503.10665

work page doi:10.48550/arxiv.2503.10665 2025
[56]

https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm } }, url = { https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm (accessed December 9, 2024)

Roboflow, Safety Inspection Dataset, (n.d.). https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm } }, url = { https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm (accessed December 9, 2024)

2024
[57]

Xiao, S.-C

B. Xiao, S.-C. Kang, Development of an Image Data Set of Construction Machines for Deep Learning Object Detection, Journal of Computing in Civil Engineering 35 (2021) 05020005. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000945

work page doi:10.1061/(asce)cp.1943-5487.0000945 2021
[58]

X. Chen, Z. Zou, Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?, (2025). https://doi.org/10.48550/arXiv.2508.11011

work page internal anchor Pith review doi:10.48550/arxiv.2508.11011 2025