Recognition: 2 theorem links
· Lean TheoremIntegration of Object Detection and Small VLMs for Construction Safety Hazard Identification
Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3
The pith
Integrating object detection with small vision-language models boosts construction hazard detection performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A detection-guided sVLM framework that first applies YOLOv11n to localize workers and construction machinery, then uses these localizations to build structured prompts for guiding sVLM reasoning, resulting in improved hazard identification and explanatory rationales on construction site images.
What carries the argument
Structured prompting based on YOLOv11n detections, which provides spatial context to small VLMs for more accurate multimodal hazard assessment.
If this is right
- Hazard detection F1-scores improve for every tested small VLM.
- Explanation quality, measured by BERTScore, increases from 0.61 to 0.82 for the best model.
- Inference overhead is minimal at 2.5 milliseconds per image.
- The framework enables efficient context-aware safety monitoring in construction environments.
- Zero-shot performance gains demonstrate the value of detection guidance without model fine-tuning.
Where Pith is reading between the lines
- This technique could be applied to other visual safety tasks, such as identifying hazards in manufacturing or warehouse settings.
- Future experiments might test the framework on live video streams to assess real-time performance.
- The approach may reduce hallucinations in sVLMs by anchoring reasoning to detected objects.
- Combining this with larger models or additional sensors could further enhance reliability in variable site conditions.
Load-bearing premise
The assumption that the YOLO detections are accurate enough and that embedding them into prompts will always improve sVLM performance without creating new errors or missing hazards outside the detected categories.
What would settle it
Running the baseline and guided versions on a new collection of construction images featuring hazards in scenes with poor object detection results, such as occluded workers or unusual equipment, and observing no consistent improvement in F1-score or explanation quality.
read the original abstract
Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes integrating YOLOv11n object detection outputs into structured prompts to guide small VLMs (<4B parameters) for zero-shot construction-site hazard identification and rationale generation. On a curated dataset of annotated construction images, the detection-guided approach yields consistent F1 gains across six sVLMs (e.g., Gemma-3 4B: 50.6% vs. 34.5% baseline) and improved explanation quality (BERTScore F1 0.82 vs. 0.61), while adding only 2.5 ms per image.
Significance. If the empirical gains prove robust, the work demonstrates a lightweight, deployable route to context-aware hazard reasoning that balances the efficiency of sVLMs against the accuracy of larger models. The negligible overhead and cross-model consistency suggest practical value for real-time safety monitoring; the prompting technique may also transfer to other grounded reasoning domains.
major comments (3)
- [Abstract] Abstract and evaluation section: performance claims rest on a 'curated dataset of construction site images with hazard annotations and explanatory rationales,' yet no cardinality, diversity statistics (lighting, occlusion, camera angles, hazard co-occurrence), inter-annotator agreement, or train/test split details are supplied. Without these, the representativeness assumption cannot be assessed and the reported F1/BERTScore deltas remain unverifiable.
- [Framework / Experiments] Framework description and experiments: the central claim that embedding YOLO detections 'reliably improve sVLM reasoning without introducing new failure modes' is untested. No ablation on imperfect detections (misses, false positives, or localization noise) is presented, leaving open whether prompt construction amplifies YOLO errors in realistic scenes.
- [Results] Results: the F1 (50.6% vs 34.5%) and BERTScore (0.82 vs 0.61) improvements are stated without error bars, confidence intervals, or statistical significance tests. This weakens the assertion of 'consistent' improvement across all six models.
minor comments (2)
- [Methods] Clarify the exact zero-shot prompt templates and how detected bounding boxes are serialized into text; an example in the methods would aid reproducibility.
- [Experiments] The overhead measurement (2.5 ms) should specify the hardware and whether it includes YOLO inference or only the added prompt construction step.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important aspects of clarity, robustness, and statistical rigor that we will address in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation section: performance claims rest on a 'curated dataset of construction site images with hazard annotations and explanatory rationales,' yet no cardinality, diversity statistics (lighting, occlusion, camera angles, hazard co-occurrence), inter-annotator agreement, or train/test split details are supplied. Without these, the representativeness assumption cannot be assessed and the reported F1/BERTScore deltas remain unverifiable.
Authors: We agree that these dataset characteristics are necessary to allow readers to evaluate representativeness and reproducibility. In the revised manuscript we will expand the Experiments section with the total number of images in the curated dataset, quantitative diversity statistics covering lighting conditions, occlusion levels, camera angles, and hazard co-occurrence frequencies, inter-annotator agreement metrics, and explicit details on the train/test split. revision: yes
-
Referee: [Framework / Experiments] Framework description and experiments: the central claim that embedding YOLO detections 'reliably improve sVLM reasoning without introducing new failure modes' is untested. No ablation on imperfect detections (misses, false positives, or localization noise) is presented, leaving open whether prompt construction amplifies YOLO errors in realistic scenes.
Authors: We acknowledge that the manuscript does not contain an explicit ablation on detection errors. While the observed gains were consistent across six independent sVLMs, this does not directly test error propagation. In the revision we will add a dedicated limitations paragraph discussing potential amplification of YOLO misses or false positives and will include a controlled ablation that injects realistic detection noise into the prompts to quantify sensitivity. revision: yes
-
Referee: [Results] Results: the F1 (50.6% vs 34.5%) and BERTScore (0.82 vs 0.61) improvements are stated without error bars, confidence intervals, or statistical significance tests. This weakens the assertion of 'consistent' improvement across all six models.
Authors: We agree that the absence of variability measures and significance testing limits the strength of the consistency claim. In the revised Results section we will report standard deviations or bootstrap confidence intervals for each metric and will add paired statistical tests (e.g., McNemar or Wilcoxon signed-rank) across the six models to substantiate the reported improvements. revision: yes
Circularity Check
No circularity; results are direct empirical measurements on held-out data
full rationale
The paper describes a pipeline that runs YOLOv11n detection, inserts bounding-box information into fixed structured prompts, and then evaluates six sVLMs in zero-shot mode on a separately curated construction-image dataset. All reported numbers (F1 50.6 % vs 34.5 %, BERTScore 0.82 vs 0.61) are obtained by executing this pipeline on the test split and computing standard metrics; no parameters are fitted to the evaluation set, no quantity is defined in terms of itself, and no self-citation supplies a uniqueness theorem or ansatz that the central claim depends on. The derivation chain is therefore a standard experimental protocol whose outputs are independent of its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained small VLMs can leverage spatially structured prompts to improve zero-shot hazard reasoning on construction scenes
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The framework first employs a YOLOv11n detector to localize workers and construction machinery... embedded into structured prompts to guide the reasoning process of sVLMs
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
six sVLMs... evaluated in zero-shot settings on a curated dataset of construction site images
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://awcbc.org (accessed April 24, 2025)
Association of Workers’ Compensation Boards of Canada, National Work Injury, Disease and Fatality Statistics, Association of Workers’ Compensation Boards of Canada, Ottawa, ON, 2023. https://awcbc.org (accessed April 24, 2025)
2023
-
[2]
Bureau of Labor Statistics, Census of Fatal Occupational Injuries, U.S
U.S. Bureau of Labor Statistics, Census of Fatal Occupational Injuries, U.S. Bureau of Labor Statistics, Washington, D.C., 2023. https://www.bls.gov/news.release/pdf/cfoi.pdf (accessed April 24, 2025)
2023
-
[3]
X. Hou, C. Li, Q. Fang, Computer vision -based safety risk computing and visualization on construction sites, Automation in Construction 156 (2023) 105129. https://doi.org/10.1016/j.autcon.2023.105129
-
[4]
S. Zhang, J. Teizer, J. -K. Lee, C.M. Eastman, M. Venugopal, Building Information Modeling (BIM) and Safety: Automatic Safety Checking of Construction Models and Schedules, Automation in Construction 29 (2013) 183 –195. https://doi.org/10.1016/j.autcon.2012.05.006
-
[5]
S. Zhang, K. Sulankivi, M. Kiviniemi, I. Romo, C.M. Eastman, J. Teizer, BIM-based fall hazard identification and prevention in construction safety planning, Safety Science 72 (2015) 31–45. https://doi.org/10.1016/j.ssci.2014.08.001
-
[6]
W. Fang, L. Ding, P.E.D. Love, H. Luo, H. Li, F. Peñ a -Mora, B. Zhong, C. Zhou, Computer vision applications in construction safety assurance, Automation in Construction 110 (2020) 103013. https://doi.org/10.1016/j.autcon.2019.103013
-
[7]
Z.-Q. Zhao, P. Zheng, S. Xu, X. Wu, Object Detection with Deep Learning: A Review, (2019). https://doi.org/10.48550/arXiv.1807.05511
-
[8]
M. Neuhausen, J. Teizer, M. Kö nig, Construction Worker Detection and Tracking in Bird’s-Eye View Camera Images, in: Taipei, Taiwan, 2018. https://doi.org/10.22260/ISARC2018/0161
-
[9]
N.D. Nath, A.H. Behzadan, S.G. Paal, Deep learning for site safety: Real -time detection of personal protective equipment, Automation in Construction 112 (2020) 103085. https://doi.org/10.1016/j.autcon.2020.103085
-
[10]
J. Kim, S. Chi, Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles, Automation in Construction 104 (2019) 255–264. https://doi.org/10.1016/j.autcon.2019.03.025
-
[11]
A. Sharifi, A. Zibaei, M. Rezaei, A deep learning based hazardous materials (HAZMAT) sign detection robot with restricted computational resources, Machine Learning with Applications 6 (2021) 100104. https://doi.org/10.1016/j.mlwa.2021.100104
-
[12]
Y. Wang, B. Xiao, A. Bouferguene, M. Al -Hussein, H. Li, Vision -based method for semantic information extraction in construction by integrating deep learning object detection and image captioning, Advanced Engineering Informatics 53 (2022) 101699. https://doi.org/10.1016/j.aei.2022.101699
-
[13]
S. Chi, S. Han, D.Y. Kim, Relationship between Unsafe Working Conditions and Workers’ Behavior and Impact of Working Conditions on Injury Severity in U.S. Construction Industry, Journal of Construction Engineering and Management 139 (2013) 826–838. https://doi.org/10.1061/(ASCE)CO.1943-7862.0000657
-
[14]
L. Zhang, J. Wang, Y. Wang, H. Sun, X. Zhao, Automatic construction site hazard identification integrating construction scene graphs with BERT based domain knowledge, 28 Automation in Construction 142 (2022) 104535. https://doi.org/10.1016/j.autcon.2022.104535
-
[15]
arXiv preprint arXiv:2405.17247 , year=
F. Bordes, R.Y. Pang, A. Ajay, A.C. Li, et al. , An Introduction to Vision -Language Modeling, (2024). https://doi.org/10.48550/arXiv.2405.17247
-
[16]
B. Yang, B. Zhang, Y. Han, B. Liu, J. Hu, Y. Jin, Vision transformer -based visual language understanding of the construction process, Alexandria Engineering Journal 99 (2024) 242–256. https://doi.org/10.1016/j.aej.2024.05.015
-
[17]
Z. Chen, H. Chen, M. Imani, R. Chen, F. Imani, Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces, Expert Systems with Applications 265 (2025) 125769. https://doi.org/10.1016/j.eswa.2024.125769
-
[18]
W.L. Tsai, J.J. Lin, S.-H. Hsieh, Generating Construction Safety Observations via CLIP- Based Image -Language Embedding, in: L. Karlinsky, T. Michaeli, K. Nishino (Eds.), Computer Vision – ECCV 2022 Workshops, Springer Nature Switzerland, Cham, 2023: pp. 366–381. https://doi.org/10.1007/978-3-031-25082-8_24
-
[19]
Q. Chen, X. Yin, Tailored Vision -Language Framework for Automated Hazard Identification And Report Generation in Construction Sites, (2025). https://doi.org/10.2139/ssrn.5137949
-
[20]
M. Adil, G. Lee, V.A. Gonzalez, Q. Mei, Using Vision Language Models for Safety Hazard Identification in Construction, (2025). https://doi.org/10.48550/arXiv.2504.09083
-
[21]
J.C.F. Chan, P.K.-Y. Wong, J.C.P. Cheng, X. Guo, J.P. -C. Chan, P. -H. Leung, X. Tao, Context-Aware Vision-Language Model Agent Enriched with Domain-Specific Ontology for Construction Site Safety Monitoring, (2025). https://doi.org/10.2139/ssrn.5080079
-
[22]
A. Sharshar, L.U. Khan, W. Ullah, M. Guizani, Vision -Language Models for Edge Networks: A Comprehensive Survey, IEEE Internet Things J. 12 (2025) 32701 –32724. https://doi.org/10.1109/JIOT.2025.3579032
-
[23]
E. Rezazadeh Azar, B. McCabe, Automated Visual Recognition of Dump Trucks in Construction Videos, Journal of Computing in Civil Engineering 26 (2012) 769 –781. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000179
-
[24]
H. Tajeen, Z. Zhu, Image dataset development for measuring construction equipment recognition performance, Automation in Construction 48 (2014) 1 –10. https://doi.org/10.1016/j.autcon.2014.07.006
-
[25]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. -Y. Fu, A.C. Berg, SSD: Single Shot MultiBox Detector, in: 2016: pp. 21 –37. https://doi.org/10.1007/978-3-319-46448- 0_2
-
[26]
You Only Look Once: Unified, Real-Time Object Detection
J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real - Time Object Detection, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: pp. 779–788. https://doi.org/10.1109/CVPR.2016.91
-
[27]
S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017) 1137–
2017
-
[28]
https://doi.org/10.1109/TPAMI.2016.2577031
-
[29]
J. Wu, N. Cai, W. Chen, H. Wang, G. Wang, Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset, Automation in Construction 106 (2019) 102894. https://doi.org/10.1016/j.autcon.2019.102894
-
[30]
D. Kim, M. Liu, S. Lee, V.R. Kamat, Remote proximity monitoring between mobile construction resources using camera -mounted UAVs, Automation in Construction 99 (2019) 168–182. https://doi.org/10.1016/j.autcon.2018.12.014
-
[31]
H. Son, C. Kim, Integrated worker detection and tracking for the safe operation of construction machinery, Automation in Construction 126 (2021) 103670. https://doi.org/10.1016/j.autcon.2021.103670. 29
-
[32]
J. Zhang, D. Zhang, X. Liu, R. Liu, G. Zhong, A Framework of on -site Construction Safety Management Using Computer Vision and Real -Time Location System, in: International Conference on Smart Infrastructure and Construction 2019 (ICSIC), ICE Publishing, 2019: pp. 327–333. https://doi.org/10.1680/icsic.64669.327
-
[33]
E. Chian, W. Fang, Y.M. Goh, J. Tian, Computer vision approaches for detecting missing barricades, Automation in Construction 131 (2021) 103862. https://doi.org/10.1016/j.autcon.2021.103862
-
[34]
Z. Xu, J. Huang, K. Huang, A novel computer vision -based approach for monitoring safety harness use in construction, IET Image Processing 17 (2023) 1071 –1085. https://doi.org/10.1049/ipr2.12696
-
[35]
H. Ann, K.Y. Koo, Deep Learning Based Fire Risk Detection on Construction Sites, Sensors 23 (2023) 9095. https://doi.org/10.3390/s23229095
-
[36]
https://www.ibm.com/think/topics/vision-language-models (accessed March 8, 2026)
What Are Vision Language Models (VLMs)? | IBM, (2025). https://www.ibm.com/think/topics/vision-language-models (accessed March 8, 2026)
2025
-
[37]
Learning Transferable Visual Models From Natural Language Supervision
A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, (2021). https://doi.org/10.48550/arXiv.2103.00020
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
-
[38]
Flamingo: a Visual Language Model for Few-Shot Learning
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M . Binkowski, R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a Visual Language Model fo...
work page internal anchor Pith review doi:10.48550/arxiv.2204.14198 2022
-
[39]
J. Li, D. Li, C. Xiong, S. Hoi, BLIP: Bootstrapping Language -Image Pre -training for Unified Vision -Language Understanding and Generation, (2022). https://doi.org/10.48550/arXiv.2201.12086
-
[40]
H. Liu, C. Li, Y. Li, Y.J. Lee, Improved Baselines with Visual Instruction Tuning, (2024). https://doi.org/10.48550/arXiv.2310.03744
work page internal anchor Pith review doi:10.48550/arxiv.2310.03744 2024
-
[41]
OpenAI, S. Adler, S. Agarwal, et al. , GPT -4 Technical Report, (2024). http://arxiv.org/abs/2303.08774 (accessed December 19, 2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
G. Team, P. Georgiev, et al., Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, (2024). https://doi.org/10.48550/arXiv.2403.05530
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530 2024
-
[43]
A. Grattafiori, A. Dubey, et al. , The Llama 3 Herd of Models, (2024). https://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[44]
A. Yang, A. Li, et al. , Qwen3 Technical Report, (2025). https://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[45]
J. Zhu, W. Wang, et al., InternVL3: Exploring Advanced Training and Test-Time Recipes for Open -Source Multimodal Models, (2025). https://doi.org/10.48550/arXiv.2504.10479
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.10479 2025
-
[46]
Radford, J.W
A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, in: Proceedings of the 38th International Confere nce on Machine Learning, PMLR, 2021: pp. 8748 –8763. https://proceedings.mlr.press/v139/radford21a.ht...
2021
-
[47]
Y. Ding, M. Liu, X. Luo, Safety compliance checking of construction behaviors using visual question answering, Automation in Construction 144 (2022) 104580. https://doi.org/10.1016/j.autcon.2022.104580
-
[48]
C. Fan, Q. Mei, X. Wang, X. Li, ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers, (2024). https://doi.org/10.48550/arXiv.2412.19954. 30
-
[50]
M.Q. Li, B.C.M. Fung, Security concerns for Large Language Models: A survey, Journal of Information Security and Applications 95 (2025) 104284. https://doi.org/10.1016/j.jisa.2025.104284
-
[51]
A. Rahman, A systematic review of vision language models: Comprehensive analysis of architectures, applications, datasets and challenges towards robust multimodal intelligence, Array (2026) 100739. https://doi.org/10.1016/j.array.2026.100739
-
[52]
G. Shinde, A. Ravi, E. Dey, S. Sakib, M. Rampure, N. Roy, A Survey on Efficient Vision- Language Models, (2025). https://doi.org/10.48550/arXiv.2504.09724
-
[53]
G. Team, A. Kamath, et al. Gemma 3 Technical Report, (2025). https://doi.org/10.48550/arXiv.2503.19786
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
-
[54]
SmolVLM: Redefining small and efficient multimodal models
A. Marafioti, O. Zohar, M. Farré , M. Noyan, E. Bakouch, P. Cuenca, C. Zakka, L.B. Allal, A. Lozhkov, N. Tazi, V. Srivastav, J. Lochner, H. Larcher, M. Morlon, L. Tunstall, L. von Werra, T. Wolf, SmolVLM: Redefining small and efficient multimodal models, (2025). https://doi.org/10.48550/arXiv.2504.05299
work page internal anchor Pith review doi:10.48550/arxiv.2504.05299 2025
-
[55]
N. Patnaik, N. Nayak, H.B. Agrawal, M.C. Khamaru, G. Bal, S.S. Panda, R. Raj, V. Meena, K. Vadlamani, Small Vision -Language Models: A Survey on Compact Architectures and Techniques, (2025). https://doi.org/10.48550/arXiv.2503.10665
-
[56]
https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm } }, url = { https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm (accessed December 9, 2024)
Roboflow, Safety Inspection Dataset, (n.d.). https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm } }, url = { https://universe.roboflow.com/vest - qibko/safety-inspection-kcbsm (accessed December 9, 2024)
2024
-
[57]
B. Xiao, S.-C. Kang, Development of an Image Data Set of Construction Machines for Deep Learning Object Detection, Journal of Computing in Civil Engineering 35 (2021) 05020005. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000945
-
[58]
X. Chen, Z. Zou, Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?, (2025). https://doi.org/10.48550/arXiv.2508.11011
work page internal anchor Pith review doi:10.48550/arxiv.2508.11011 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.