pith. machine review for the scientific record. sign in

arxiv: 2604.05210 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords construction safetyhazard detectionsmall vision-language modelsobject detectionYOLOmultimodal reasoningsafety hazard identification
0
0 comments X

The pith

Integrating object detection with small vision-language models boosts construction hazard detection performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that uses a YOLO object detector to identify workers and machinery in construction scenes and then incorporates these detections into prompts for small vision-language models. This guidance helps the models reason more effectively about potential hazards, leading to higher detection accuracy and better quality explanations compared to using the models alone. By testing this on six different small VLMs with a dataset of annotated construction images, the authors show consistent improvements, with the best model achieving an F1-score of 50.6 percent versus 34.5 percent in the unguided case. The method adds very little processing time, making it suitable for practical safety applications on job sites where fast responses are needed.

Core claim

A detection-guided sVLM framework that first applies YOLOv11n to localize workers and construction machinery, then uses these localizations to build structured prompts for guiding sVLM reasoning, resulting in improved hazard identification and explanatory rationales on construction site images.

What carries the argument

Structured prompting based on YOLOv11n detections, which provides spatial context to small VLMs for more accurate multimodal hazard assessment.

If this is right

  • Hazard detection F1-scores improve for every tested small VLM.
  • Explanation quality, measured by BERTScore, increases from 0.61 to 0.82 for the best model.
  • Inference overhead is minimal at 2.5 milliseconds per image.
  • The framework enables efficient context-aware safety monitoring in construction environments.
  • Zero-shot performance gains demonstrate the value of detection guidance without model fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could be applied to other visual safety tasks, such as identifying hazards in manufacturing or warehouse settings.
  • Future experiments might test the framework on live video streams to assess real-time performance.
  • The approach may reduce hallucinations in sVLMs by anchoring reasoning to detected objects.
  • Combining this with larger models or additional sensors could further enhance reliability in variable site conditions.

Load-bearing premise

The assumption that the YOLO detections are accurate enough and that embedding them into prompts will always improve sVLM performance without creating new errors or missing hazards outside the detected categories.

What would settle it

Running the baseline and guided versions on a new collection of construction images featuring hazards in scenes with poor object detection results, such as occluded workers or unusual equipment, and observing no consistent improvement in F1-score or explanation quality.

read the original abstract

Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes integrating YOLOv11n object detection outputs into structured prompts to guide small VLMs (<4B parameters) for zero-shot construction-site hazard identification and rationale generation. On a curated dataset of annotated construction images, the detection-guided approach yields consistent F1 gains across six sVLMs (e.g., Gemma-3 4B: 50.6% vs. 34.5% baseline) and improved explanation quality (BERTScore F1 0.82 vs. 0.61), while adding only 2.5 ms per image.

Significance. If the empirical gains prove robust, the work demonstrates a lightweight, deployable route to context-aware hazard reasoning that balances the efficiency of sVLMs against the accuracy of larger models. The negligible overhead and cross-model consistency suggest practical value for real-time safety monitoring; the prompting technique may also transfer to other grounded reasoning domains.

major comments (3)
  1. [Abstract] Abstract and evaluation section: performance claims rest on a 'curated dataset of construction site images with hazard annotations and explanatory rationales,' yet no cardinality, diversity statistics (lighting, occlusion, camera angles, hazard co-occurrence), inter-annotator agreement, or train/test split details are supplied. Without these, the representativeness assumption cannot be assessed and the reported F1/BERTScore deltas remain unverifiable.
  2. [Framework / Experiments] Framework description and experiments: the central claim that embedding YOLO detections 'reliably improve sVLM reasoning without introducing new failure modes' is untested. No ablation on imperfect detections (misses, false positives, or localization noise) is presented, leaving open whether prompt construction amplifies YOLO errors in realistic scenes.
  3. [Results] Results: the F1 (50.6% vs 34.5%) and BERTScore (0.82 vs 0.61) improvements are stated without error bars, confidence intervals, or statistical significance tests. This weakens the assertion of 'consistent' improvement across all six models.
minor comments (2)
  1. [Methods] Clarify the exact zero-shot prompt templates and how detected bounding boxes are serialized into text; an example in the methods would aid reproducibility.
  2. [Experiments] The overhead measurement (2.5 ms) should specify the hardware and whether it includes YOLO inference or only the added prompt construction step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of clarity, robustness, and statistical rigor that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation section: performance claims rest on a 'curated dataset of construction site images with hazard annotations and explanatory rationales,' yet no cardinality, diversity statistics (lighting, occlusion, camera angles, hazard co-occurrence), inter-annotator agreement, or train/test split details are supplied. Without these, the representativeness assumption cannot be assessed and the reported F1/BERTScore deltas remain unverifiable.

    Authors: We agree that these dataset characteristics are necessary to allow readers to evaluate representativeness and reproducibility. In the revised manuscript we will expand the Experiments section with the total number of images in the curated dataset, quantitative diversity statistics covering lighting conditions, occlusion levels, camera angles, and hazard co-occurrence frequencies, inter-annotator agreement metrics, and explicit details on the train/test split. revision: yes

  2. Referee: [Framework / Experiments] Framework description and experiments: the central claim that embedding YOLO detections 'reliably improve sVLM reasoning without introducing new failure modes' is untested. No ablation on imperfect detections (misses, false positives, or localization noise) is presented, leaving open whether prompt construction amplifies YOLO errors in realistic scenes.

    Authors: We acknowledge that the manuscript does not contain an explicit ablation on detection errors. While the observed gains were consistent across six independent sVLMs, this does not directly test error propagation. In the revision we will add a dedicated limitations paragraph discussing potential amplification of YOLO misses or false positives and will include a controlled ablation that injects realistic detection noise into the prompts to quantify sensitivity. revision: yes

  3. Referee: [Results] Results: the F1 (50.6% vs 34.5%) and BERTScore (0.82 vs 0.61) improvements are stated without error bars, confidence intervals, or statistical significance tests. This weakens the assertion of 'consistent' improvement across all six models.

    Authors: We agree that the absence of variability measures and significance testing limits the strength of the consistency claim. In the revised Results section we will report standard deviations or bootstrap confidence intervals for each metric and will add paired statistical tests (e.g., McNemar or Wilcoxon signed-rank) across the six models to substantiate the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements on held-out data

full rationale

The paper describes a pipeline that runs YOLOv11n detection, inserts bounding-box information into fixed structured prompts, and then evaluates six sVLMs in zero-shot mode on a separately curated construction-image dataset. All reported numbers (F1 50.6 % vs 34.5 %, BERTScore 0.82 vs 0.61) are obtained by executing this pipeline on the test split and computing standard metrics; no parameters are fitted to the evaluation set, no quantity is defined in terms of itself, and no self-citation supplies a uniqueness theorem or ansatz that the central claim depends on. The derivation chain is therefore a standard experimental protocol whose outputs are independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard pretrained models and a newly curated evaluation set; no new free parameters are fitted and no novel entities are postulated.

axioms (1)
  • domain assumption Pretrained small VLMs can leverage spatially structured prompts to improve zero-shot hazard reasoning on construction scenes
    The framework assumes the selected sVLMs possess sufficient base capability that detection-guided prompting yields the observed gains without fine-tuning.

pith-pipeline@v0.9.0 · 5636 in / 1407 out tokens · 91107 ms · 2026-05-10T19:16:14.831576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 51 canonical work pages · 11 internal anchors

  1. [1]

    https://awcbc.org (accessed April 24, 2025)

    Association of Workers’ Compensation Boards of Canada, National Work Injury, Disease and Fatality Statistics, Association of Workers’ Compensation Boards of Canada, Ottawa, ON, 2023. https://awcbc.org (accessed April 24, 2025)

  2. [2]

    Bureau of Labor Statistics, Census of Fatal Occupational Injuries, U.S

    U.S. Bureau of Labor Statistics, Census of Fatal Occupational Injuries, U.S. Bureau of Labor Statistics, Washington, D.C., 2023. https://www.bls.gov/news.release/pdf/cfoi.pdf (accessed April 24, 2025)

  3. [3]

    X. Hou, C. Li, Q. Fang, Computer vision -based safety risk computing and visualization on construction sites, Automation in Construction 156 (2023) 105129. https://doi.org/10.1016/j.autcon.2023.105129

  4. [4]

    Zhang, J

    S. Zhang, J. Teizer, J. -K. Lee, C.M. Eastman, M. Venugopal, Building Information Modeling (BIM) and Safety: Automatic Safety Checking of Construction Models and Schedules, Automation in Construction 29 (2013) 183 –195. https://doi.org/10.1016/j.autcon.2012.05.006

  5. [5]

    Zhang, K

    S. Zhang, K. Sulankivi, M. Kiviniemi, I. Romo, C.M. Eastman, J. Teizer, BIM-based fall hazard identification and prevention in construction safety planning, Safety Science 72 (2015) 31–45. https://doi.org/10.1016/j.ssci.2014.08.001

  6. [6]

    W. Fang, L. Ding, P.E.D. Love, H. Luo, H. Li, F. Peñ a -Mora, B. Zhong, C. Zhou, Computer vision applications in construction safety assurance, Automation in Construction 110 (2020) 103013. https://doi.org/10.1016/j.autcon.2019.103013

  7. [7]

    Z.-Q. Zhao, P. Zheng, S. Xu, X. Wu, Object Detection with Deep Learning: A Review, (2019). https://doi.org/10.48550/arXiv.1807.05511

  8. [8]

    Neuhausen, J

    M. Neuhausen, J. Teizer, M. Kö nig, Construction Worker Detection and Tracking in Bird’s-Eye View Camera Images, in: Taipei, Taiwan, 2018. https://doi.org/10.22260/ISARC2018/0161

  9. [9]

    Nath, A.H

    N.D. Nath, A.H. Behzadan, S.G. Paal, Deep learning for site safety: Real -time detection of personal protective equipment, Automation in Construction 112 (2020) 103085. https://doi.org/10.1016/j.autcon.2020.103085

  10. [10]

    J. Kim, S. Chi, Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles, Automation in Construction 104 (2019) 255–264. https://doi.org/10.1016/j.autcon.2019.03.025

  11. [11]

    Sharifi, A

    A. Sharifi, A. Zibaei, M. Rezaei, A deep learning based hazardous materials (HAZMAT) sign detection robot with restricted computational resources, Machine Learning with Applications 6 (2021) 100104. https://doi.org/10.1016/j.mlwa.2021.100104

  12. [12]

    Y. Wang, B. Xiao, A. Bouferguene, M. Al -Hussein, H. Li, Vision -based method for semantic information extraction in construction by integrating deep learning object detection and image captioning, Advanced Engineering Informatics 53 (2022) 101699. https://doi.org/10.1016/j.aei.2022.101699

  13. [13]

    S. Chi, S. Han, D.Y. Kim, Relationship between Unsafe Working Conditions and Workers’ Behavior and Impact of Working Conditions on Injury Severity in U.S. Construction Industry, Journal of Construction Engineering and Management 139 (2013) 826–838. https://doi.org/10.1061/(ASCE)CO.1943-7862.0000657

  14. [14]

    Zhang, J

    L. Zhang, J. Wang, Y. Wang, H. Sun, X. Zhao, Automatic construction site hazard identification integrating construction scene graphs with BERT based domain knowledge, 28 Automation in Construction 142 (2022) 104535. https://doi.org/10.1016/j.autcon.2022.104535

  15. [15]

    arXiv preprint arXiv:2405.17247 , year=

    F. Bordes, R.Y. Pang, A. Ajay, A.C. Li, et al. , An Introduction to Vision -Language Modeling, (2024). https://doi.org/10.48550/arXiv.2405.17247

  16. [16]

    B. Yang, B. Zhang, Y. Han, B. Liu, J. Hu, Y. Jin, Vision transformer -based visual language understanding of the construction process, Alexandria Engineering Journal 99 (2024) 242–256. https://doi.org/10.1016/j.aej.2024.05.015

  17. [17]

    Z. Chen, H. Chen, M. Imani, R. Chen, F. Imani, Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces, Expert Systems with Applications 265 (2025) 125769. https://doi.org/10.1016/j.eswa.2024.125769

  18. [18]

    Tsai, J.J

    W.L. Tsai, J.J. Lin, S.-H. Hsieh, Generating Construction Safety Observations via CLIP- Based Image -Language Embedding, in: L. Karlinsky, T. Michaeli, K. Nishino (Eds.), Computer Vision – ECCV 2022 Workshops, Springer Nature Switzerland, Cham, 2023: pp. 366–381. https://doi.org/10.1007/978-3-031-25082-8_24

  19. [19]

    Q. Chen, X. Yin, Tailored Vision -Language Framework for Automated Hazard Identification And Report Generation in Construction Sites, (2025). https://doi.org/10.2139/ssrn.5137949

  20. [20]

    M. Adil, G. Lee, V.A. Gonzalez, Q. Mei, Using Vision Language Models for Safety Hazard Identification in Construction, (2025). https://doi.org/10.48550/arXiv.2504.09083

  21. [21]

    Chan, P.K.-Y

    J.C.F. Chan, P.K.-Y. Wong, J.C.P. Cheng, X. Guo, J.P. -C. Chan, P. -H. Leung, X. Tao, Context-Aware Vision-Language Model Agent Enriched with Domain-Specific Ontology for Construction Site Safety Monitoring, (2025). https://doi.org/10.2139/ssrn.5080079

  22. [22]

    Sharshar, L.U

    A. Sharshar, L.U. Khan, W. Ullah, M. Guizani, Vision -Language Models for Edge Networks: A Comprehensive Survey, IEEE Internet Things J. 12 (2025) 32701 –32724. https://doi.org/10.1109/JIOT.2025.3579032

  23. [23]

    Rezazadeh Azar, B

    E. Rezazadeh Azar, B. McCabe, Automated Visual Recognition of Dump Trucks in Construction Videos, Journal of Computing in Civil Engineering 26 (2012) 769 –781. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000179

  24. [24]

    Tajeen, Z

    H. Tajeen, Z. Zhu, Image dataset development for measuring construction equipment recognition performance, Automation in Construction 48 (2014) 1 –10. https://doi.org/10.1016/j.autcon.2014.07.006

  25. [25]

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. -Y. Fu, A.C. Berg, SSD: Single Shot MultiBox Detector, in: 2016: pp. 21 –37. https://doi.org/10.1007/978-3-319-46448- 0_2

  26. [26]

    You Only Look Once: Unified, Real-Time Object Detection

    J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real - Time Object Detection, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: pp. 779–788. https://doi.org/10.1109/CVPR.2016.91

  27. [27]

    S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017) 1137–

  28. [28]

    https://doi.org/10.1109/TPAMI.2016.2577031

  29. [29]

    J. Wu, N. Cai, W. Chen, H. Wang, G. Wang, Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset, Automation in Construction 106 (2019) 102894. https://doi.org/10.1016/j.autcon.2019.102894

  30. [30]

    D. Kim, M. Liu, S. Lee, V.R. Kamat, Remote proximity monitoring between mobile construction resources using camera -mounted UAVs, Automation in Construction 99 (2019) 168–182. https://doi.org/10.1016/j.autcon.2018.12.014

  31. [31]

    H. Son, C. Kim, Integrated worker detection and tracking for the safe operation of construction machinery, Automation in Construction 126 (2021) 103670. https://doi.org/10.1016/j.autcon.2021.103670. 29

  32. [32]

    Zhang, D

    J. Zhang, D. Zhang, X. Liu, R. Liu, G. Zhong, A Framework of on -site Construction Safety Management Using Computer Vision and Real -Time Location System, in: International Conference on Smart Infrastructure and Construction 2019 (ICSIC), ICE Publishing, 2019: pp. 327–333. https://doi.org/10.1680/icsic.64669.327

  33. [33]

    Chian, W

    E. Chian, W. Fang, Y.M. Goh, J. Tian, Computer vision approaches for detecting missing barricades, Automation in Construction 131 (2021) 103862. https://doi.org/10.1016/j.autcon.2021.103862

  34. [34]

    Z. Xu, J. Huang, K. Huang, A novel computer vision -based approach for monitoring safety harness use in construction, IET Image Processing 17 (2023) 1071 –1085. https://doi.org/10.1049/ipr2.12696

  35. [35]

    Ann, K.Y

    H. Ann, K.Y. Koo, Deep Learning Based Fire Risk Detection on Construction Sites, Sensors 23 (2023) 9095. https://doi.org/10.3390/s23229095

  36. [36]

    https://www.ibm.com/think/topics/vision-language-models (accessed March 8, 2026)

    What Are Vision Language Models (VLMs)? | IBM, (2025). https://www.ibm.com/think/topics/vision-language-models (accessed March 8, 2026)

  37. [37]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, (2021). https://doi.org/10.48550/arXiv.2103.00020

  38. [38]

    Flamingo: a Visual Language Model for Few-Shot Learning

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M . Binkowski, R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a Visual Language Model fo...

  39. [39]

    J. Li, D. Li, C. Xiong, S. Hoi, BLIP: Bootstrapping Language -Image Pre -training for Unified Vision -Language Understanding and Generation, (2022). https://doi.org/10.48550/arXiv.2201.12086

  40. [40]

    H. Liu, C. Li, Y. Li, Y.J. Lee, Improved Baselines with Visual Instruction Tuning, (2024). https://doi.org/10.48550/arXiv.2310.03744

  41. [41]

    GPT-4 Technical Report

    OpenAI, S. Adler, S. Agarwal, et al. , GPT -4 Technical Report, (2024). http://arxiv.org/abs/2303.08774 (accessed December 19, 2024)

  42. [42]

    G. Team, P. Georgiev, et al., Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, (2024). https://doi.org/10.48550/arXiv.2403.05530

  43. [43]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, et al. , The Llama 3 Herd of Models, (2024). https://doi.org/10.48550/arXiv.2407.21783

  44. [44]

    A. Yang, A. Li, et al. , Qwen3 Technical Report, (2025). https://doi.org/10.48550/arXiv.2505.09388

  45. [45]

    J. Zhu, W. Wang, et al., InternVL3: Exploring Advanced Training and Test-Time Recipes for Open -Source Multimodal Models, (2025). https://doi.org/10.48550/arXiv.2504.10479

  46. [46]

    Radford, J.W

    A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, in: Proceedings of the 38th International Confere nce on Machine Learning, PMLR, 2021: pp. 8748 –8763. https://proceedings.mlr.press/v139/radford21a.ht...

  47. [47]

    Y. Ding, M. Liu, X. Luo, Safety compliance checking of construction behaviors using visual question answering, Automation in Construction 144 (2022) 104580. https://doi.org/10.1016/j.autcon.2022.104580

  48. [48]

    C. Fan, Q. Mei, X. Wang, X. Li, ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers, (2024). https://doi.org/10.48550/arXiv.2412.19954. 30

  49. [50]

    Li, B.C.M

    M.Q. Li, B.C.M. Fung, Security concerns for Large Language Models: A survey, Journal of Information Security and Applications 95 (2025) 104284. https://doi.org/10.1016/j.jisa.2025.104284

  50. [51]

    A. Rahman, A systematic review of vision language models: Comprehensive analysis of architectures, applications, datasets and challenges towards robust multimodal intelligence, Array (2026) 100739. https://doi.org/10.1016/j.array.2026.100739

  51. [52]

    Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, and 1 others

    G. Shinde, A. Ravi, E. Dey, S. Sakib, M. Rampure, N. Roy, A Survey on Efficient Vision- Language Models, (2025). https://doi.org/10.48550/arXiv.2504.09724

  52. [53]

    G. Team, A. Kamath, et al. Gemma 3 Technical Report, (2025). https://doi.org/10.48550/arXiv.2503.19786

  53. [54]

    SmolVLM: Redefining small and efficient multimodal models

    A. Marafioti, O. Zohar, M. Farré , M. Noyan, E. Bakouch, P. Cuenca, C. Zakka, L.B. Allal, A. Lozhkov, N. Tazi, V. Srivastav, J. Lochner, H. Larcher, M. Morlon, L. Tunstall, L. von Werra, T. Wolf, SmolVLM: Redefining small and efficient multimodal models, (2025). https://doi.org/10.48550/arXiv.2504.05299

  54. [55]

    Patnaik, N

    N. Patnaik, N. Nayak, H.B. Agrawal, M.C. Khamaru, G. Bal, S.S. Panda, R. Raj, V. Meena, K. Vadlamani, Small Vision -Language Models: A Survey on Compact Architectures and Techniques, (2025). https://doi.org/10.48550/arXiv.2503.10665

  55. [56]

    https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm } }, url = { https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm (accessed December 9, 2024)

    Roboflow, Safety Inspection Dataset, (n.d.). https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm } }, url = { https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm (accessed December 9, 2024)

  56. [57]

    Xiao, S.-C

    B. Xiao, S.-C. Kang, Development of an Image Data Set of Construction Machines for Deep Learning Object Detection, Journal of Computing in Civil Engineering 35 (2021) 05020005. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000945

  57. [58]

    X. Chen, Z. Zou, Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?, (2025). https://doi.org/10.48550/arXiv.2508.11011