A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

Imad Gohar; Malikussaid

arxiv: 2605.26533 · v1 · pith:USCW3EXYnew · submitted 2026-05-26 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

Malikussaid , Imad Gohar This is my paper

Pith reviewed 2026-06-29 18:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords industrial inspectiondefect detectionreport generationvision-language modelQLoRAwind turbinehybrid architecturestructured JSON output

0 comments

The pith

A three-component pipeline with a 1.5B adapted model generates higher-quality defect reports than a 671B generalist vision-language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a decoupled system for wind turbine blade inspection that separates defect localization from report generation. A YOLO detector finds oriented bounding boxes at native resolution. A parameter-free bridge turns those boxes into grid-referenced tokens inside a structured prompt. A QLoRA-adapted 1.5B model then produces a JSON maintenance report, with retrieval-augmented fine-tuning to ground recommendations in procedures. The full pipeline scores BLEU-4 0.41, hallucination rate 4 percent, and expert score 8.6 out of 10, versus 0.07, 65 percent, and 3.3 for a zero-shot large VLM baseline. The same small model also outperforms a 671B-parameter generalist model when both receive identical detection evidence, while running at 47 tokens per second on a single T4 GPU.

Core claim

The decoupled Eyes-Bridge-Brain pipeline, with a 4-bit quantized Qwen-2.5-1.5B model adapted via QLoRA on 947 synthetic reports and RAFT for procedure grounding, produces structured JSON reports that achieve BLEU-4 of 0.41, hallucination rate of 4 percent, and expert score of 8.6/10, exceeding both zero-shot VLM baselines and a 671B generalist model given the same detection input.

What carries the argument

The three-part pipeline: YOLO26-x-obb detector for oriented bounding boxes, deterministic Bridge module that encodes boxes into grid-referenced spatial tokens, and QLoRA-adapted 1.5B LLM that converts the prompt into a structured JSON report.

If this is right

The complete pipeline runs at 47 tokens per second on a single T4-class GPU, enabling edge deployment.
Ablation results show that removing any one component increases hallucination rate and lowers expert scores.
The 1.5B QLoRA model produces higher-quality reports than the 671B generalist model when both receive identical detection evidence.
Retrieval-augmented fine-tuning grounds recommendations in indexed maintenance procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The deterministic Bridge encoding of spatial tokens could be added to other vision-language models to reduce spatial hallucinations without retraining the full model.
If synthetic report generation can be scaled to new industrial domains, the same small-model adaptation approach may apply beyond wind turbines.
The performance gap between the adapted 1.5B model and the 671B baseline suggests that task-specific structure and domain data matter more than raw parameter count for structured output tasks.

Load-bearing premise

The 947 synthetically generated maintenance reports represent real-world scenarios and the LLM-as-a-Judge scores align with actual expert judgment.

What would settle it

Run the pipeline on a held-out set of real expert-written maintenance reports from actual wind turbine inspections and measure whether BLEU-4, hallucination rate, and expert scores remain higher than the large VLM baseline.

read the original abstract

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main result is a small QLoRA-adapted 1.5B model plus deterministic bridge beating a 671B VLM on synthetic report metrics, but the synthetic data and LLM judge leave the real-world claim unproven.

read the letter

The core takeaway is that a decoupled pipeline—YOLO-OBB detection, a parameter-free spatial token bridge, and a 1.5B Qwen model with QLoRA plus RAFT—produces higher BLEU-4, lower hallucination, and better LLM-judge scores than a zero-shot large VLM or even a 671B generalist model when both get the same detection input. The architecture is edge-friendly at 47 tokens per second on a T4 and the ablations show each piece contributes.

What stands out is the explicit separation of localization, spatial grounding, and language generation, plus the use of a tiny domain-adapted model instead of scaling up. The deterministic bridge avoids learned spatial encoders, which keeps the system simple and reproducible. The numbers are concrete and the comparison to the large model is direct.

The soft spot is the evaluation foundation. All training and scoring rest on 947 synthetically generated reports with no reported procedure for how they were made, no diversity stats, and no human expert validation against actual maintenance language. The LLM-as-Judge rubric is used without details on its calibration or agreement with real inspectors. If the synthetic corpus does not match the lexical and recommendation patterns of real wind-turbine reports, the adaptation and the judge scores mainly confirm fit to an artificial distribution rather than genuine improvement on the task.

This is for researchers building practical inspection systems who want to see whether small adapted models can replace end-to-end giants on structured output. It is worth sending to peer review because the architecture is clearly described, the ablations are present, and the performance gap is large enough to test; reviewers will need to press on the data generation and judge validation, but the work is coherent enough to merit that step.

Referee Report

2 major / 0 minor

Summary. The paper proposes a decoupled three-component pipeline for wind-turbine blade defect inspection and structured report generation: a YOLO26-x-obb detector (Eyes) for oriented bounding-box localization, a deterministic parameter-free Bridge module that encodes detections into grid-referenced spatial tokens, and a 4-bit QLoRA-adapted Qwen-2.5-1.5B model (Brain) fine-tuned on 947 synthetically generated maintenance reports plus RAFT, which produces JSON reports. Five ablations using BLEU-4, ROUGE-L, Hallucination Rate, and LLM-as-Judge scoring show the full pipeline reaching BLEU-4 0.41 / HR=4% / Expert Score 8.6/10 versus 0.07 / 65% / 3.3/10 for a zero-shot VLM baseline and outperforming a 671B generalist model at 47 tokens/s on a T4 GPU.

Significance. If the central claims hold after validation, the work would demonstrate that a purpose-built, edge-deployable decoupled architecture with modest domain-specific adaptation can exceed both zero-shot VLMs and much larger generalist models on structured industrial report generation. The quantitative ablation suite and explicit comparison to a 671B baseline provide concrete evidence for the value of task decomposition and small-model specialization in this domain.

major comments (2)

[Abstract] Abstract (final two paragraphs): The reported superiority (BLEU-4 0.41, HR=4%, Expert Score 8.6/10) and the claim that the QLoRA-adapted 1.5B model outperforms the 671B generalist model rest entirely on adaptation and evaluation using 947 synthetically generated reports together with an unvalidated LLM-as-Judge rubric. No generation procedure, lexical/structural diversity statistics, or human-expert alignment study for these reports is supplied, which directly undermines the generalization argument that the pipeline performs better on the actual maintenance task.
[Abstract] Abstract (ablation description): The five ablation experiments compare the complete pipeline against a monolithic VLM baseline and partial configurations, yet the evaluation remains confined to held-out synthetic reports. Without evidence that the synthetic corpus reproduces the lexical, recommendation, and variability distributions of real wind-turbine maintenance reports, the ablation results cannot establish that the decoupled design improves real-world defect reasoning.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and the focus on the synthetic data foundation. We address each major comment below. Where the manuscript is incomplete we propose targeted additions; where new empirical validation would be required we note the limitation explicitly.

read point-by-point responses

Referee: [Abstract] Abstract (final two paragraphs): The reported superiority (BLEU-4 0.41, HR=4%, Expert Score 8.6/10) and the claim that the QLoRA-adapted 1.5B model outperforms the 671B generalist model rest entirely on adaptation and evaluation using 947 synthetically generated reports together with an unvalidated LLM-as-Judge rubric. No generation procedure, lexical/structural diversity statistics, or human-expert alignment study for these reports is supplied, which directly undermines the generalization argument that the pipeline performs better on the actual maintenance task.

Authors: We agree that the generation procedure and diversity statistics must be supplied. The 947 reports were produced by a deterministic template engine seeded with defect taxonomies and maintenance-action lists drawn from our industrial partner’s historical logs; each template was then varied by sampling from a small set of lexical paraphrases and recommendation phrasings. We will add a new subsection (3.3) that documents the template grammar, the sampling procedure, and quantitative diversity measures (type-token ratio, n-gram entropy, and structural variance across JSON fields). We also acknowledge that no separate human-expert alignment study was performed; this is a genuine limitation of the current study and will be stated as such in the revised Limitations paragraph. revision: partial
Referee: [Abstract] Abstract (ablation description): The five ablation experiments compare the complete pipeline against a monolithic VLM baseline and partial configurations, yet the evaluation remains confined to held-out synthetic reports. Without evidence that the synthetic corpus reproduces the lexical, recommendation, and variability distributions of real wind-turbine maintenance reports, the ablation results cannot establish that the decoupled design improves real-world defect reasoning.

Authors: The ablations are performed on held-out synthetic reports by design, because the controlled corpus lets us isolate the contribution of each pipeline stage without confounding factors from real-world annotation noise. We will expand the Data section to include a side-by-side comparison of key statistics (average report length, frequency of each defect class, distribution of recommendation verbs) between the synthetic corpus and a small set of redacted real maintenance reports that our partner permitted us to inspect. We cannot, however, release or evaluate on a large public real-report corpus; therefore the claim that the architecture improves real-world performance rests on the assumption that the synthetic distribution is sufficiently representative—an assumption we will now qualify in the text. revision: partial

standing simulated objections not resolved

A full human-expert alignment study comparing synthetic versus real reports cannot be supplied without additional data access and annotation resources that are outside the scope of the present work.

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on held-out data

full rationale

The paper presents a decoupled pipeline evaluated via standard metrics (BLEU-4, ROUGE-L, HR, LLM-as-Judge) on held-out synthetic reports after QLoRA adaptation, with ablations against baselines. No load-bearing step reduces by construction to its inputs, no self-definitional mappings, no fitted parameters renamed as predictions, and no self-citation chains invoked as uniqueness theorems. The central claims rest on independent held-out comparisons rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of synthetic data and the deterministic bridge module, which are not independently verified in the provided abstract. No explicit free parameters or invented entities are described.

axioms (1)

domain assumption Synthetic maintenance reports can train a model to produce accurate real reports.
The training relies on 947 synthetic reports without mention of validation against real data.

pith-pipeline@v0.9.1-grok · 5864 in / 1277 out tokens · 37317 ms · 2026-06-29T18:47:39.066891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 33 canonical work pages

[1]

The impact of individual head-related transfer function augmen- tation on spatial release from masking,

Zhong, D., Xia, Z., Zhu, Y., Duan, J.: Overview of predictive maintenance based on digital twin technology. Heliyon9(4), 14534 (2023) https://doi.org/10.1016/j. heliyon.2023.e14534

work page doi:10.1016/j 2023
[2]

Future Internet17(11), 528 (2025) https://doi.org/10.3390/fi17110528

Hamdi, A., Noura, H.N.: Ai-driven damage detection in wind turbines: Drone imagery and lightweight deep learning approaches. Future Internet17(11), 528 (2025) https://doi.org/10.3390/fi17110528

work page doi:10.3390/fi17110528 2025
[3]

Mea- surement Science and Technology36(9), 095416 (2025) https://doi.org/10.1088/ 1361-6501/ae08db

Si, Y., Ding, Y., Ge, F., Wu, X., Liu, J., Ding, D., Zhang, H.: A multi-scale defect detection network for wind turbines utilizing margin aware features. Mea- surement Science and Technology36(9), 095416 (2025) https://doi.org/10.1088/ 1361-6501/ae08db

2025
[4]

Engineering, Technology & Applied Science Research15(6), 30267–30276 (2025) https://doi.org/10.48084/etasr.14220

Zheng, B., Angkawisittpan, N., Huang, L., Sonasang, S.: An improved yolov11n algorithm with conv2former and pw-iou for uav inspection of power line insula- tors. Engineering, Technology & Applied Science Research15(6), 30267–30276 (2025) https://doi.org/10.48084/etasr.14220

work page doi:10.48084/etasr.14220 2025
[5]

Applied Sciences15(11), 6117 (2025) https://doi.org/10.3390/ app15116117

Deng, Z., Li, X., Yang, R.: Rml-yolo: An insulator defect detection method for uav aerial images. Applied Sciences15(11), 6117 (2025) https://doi.org/10.3390/ app15116117

2025
[6]

Proceedings of the AAAI Conference on Artificial Intelligence38(3), 1932–1940 (2024) https://doi.org/10

Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. Proceedings of the AAAI Conference on Artificial Intelligence38(3), 1932–1940 (2024) https://doi.org/10. 1609/aaai.v38i3.27963

1932
[7]

& Sung, J

Cai, W., Huang, W., Cao, Y., Huang, C., Yuan, F., Zhang, B., Wen, J.: Towards vlm-based hybrid explainable prompt enhancement for zero-shot indus- trial anomaly detection. In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pp. 711–719. International Joint Conferences on Artificial Intelligence Organization, ???...

work page doi:10.24963/ijcai 2025
[8]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 19 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 19 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

work page doi:10.1109/tpami.2024.3369699 2024
[9]

International Journal of Computer Vision133(6), 3689–3726 (2025) https://doi

Yang, M., Wang, Z.: Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision133(6), 3689–3726 (2025) https://doi. org/10.1007/s11263-025-02357-y

work page doi:10.1007/s11263-025-02357-y 2025
[10]

https://doi.org/10.2139/ssrn

Bai, Y., Zhang, J., Dong, Y., Cao, Y., Tian, G.: Dual-Path Frequency Discrim- inators for Few-Shot Anomaly Detection (2024). https://doi.org/10.2139/ssrn. 4862099

work page doi:10.2139/ssrn 2024
[11]

https://docs.ultralytics.com/models/yolo26/ (2026)

Ultralytics: Ultralytics YOLO26. https://docs.ultralytics.com/models/yolo26/ (2026)

2026
[12]

arXiv preprint arXiv:2509.25164 (2025)

Sapkota, R., Cheppally, R.H., Sharda, A., Karkee, M.: YOLO26: Key archi- tectural enhancements and performance benchmarking for real-time object detection. arXiv preprint arXiv:2509.25164 (2025)

work page arXiv 2025
[13]

Interna- tional Journal of Computer Vision60(2), 91–110 (2004) https://doi.org/10.1023/ B:VISI.0000029664.99615.94

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision60(2), 91–110 (2004) https://doi.org/10.1023/ B:VISI.0000029664.99615.94

work page arXiv 2004
[14]

IEEE Transactions on Pattern Analy- sis and Machine Intelligence39(6), 1137–1149 (2017) https://doi.org/10.1109/ TPAMI.2016.2577031

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analy- sis and Machine Intelligence39(6), 1137–1149 (2017) https://doi.org/10.1109/ TPAMI.2016.2577031

work page arXiv 2017
[15]

International Journal of Precision Engineering and Manufacturing-Green Technology9(2), 661–691 (2022) https://doi.org/10.1007/ s40684-021-00343-6

Ren, Z., Fang, F., Yan, N., Wu, Y.: State of the art in defect detection based on machine vision. International Journal of Precision Engineering and Manufacturing-Green Technology9(2), 661–691 (2022) https://doi.org/10.1007/ s40684-021-00343-6

2022
[16]

Journal of Advanced Research35, 33–48 (2022) https://doi.org/10.1016/j.jare.2021.03.015

Tulbure, A.-A., Tulbure, A.-A., Dulf, E.-H.: A review on modern defect detection models using dcnns – deep convolutional neural networks. Journal of Advanced Research35, 33–48 (2022) https://doi.org/10.1016/j.jare.2021.03.015

work page doi:10.1016/j.jare.2021.03.015 2022
[17]

Renewable Energy253, 123489 (2025) https://doi.org/10.1016/j.renene.2025.123489

Zhao, B., Li, X., Wang, G., Gao, H., Lv, C., Cao, S.: End-to-end wind turbine damage detection model based on multi-branch feature sensing and contextual information reuse in harsh environments. Renewable Energy253, 123489 (2025) https://doi.org/10.1016/j.renene.2025.123489

work page doi:10.1016/j.renene.2025.123489 2025
[18]

IEEE Transactions on Instrumentation and Measurement67(2), 257–269 (2018) https://doi.org/10.1109/TIM.2017.2775345

Chen, J., Liu, Z., Wang, H., Nunez, A., Han, Z.: Automatic defect detection of fasteners on the catenary support device using deep convolutional neural network. IEEE Transactions on Instrumentation and Measurement67(2), 257–269 (2018) https://doi.org/10.1109/TIM.2017.2775345

work page doi:10.1109/tim.2017.2775345 2018
[19]

Processes13(11), 3714 (2025) https://doi.org/10.3390/ pr13113714

Liu, S., Zhang, W., Yuan, S., Bao, H., Mao, W., Xi, S.: A lightweight model for 20 insulator defect detection based on vision–language modeling and prior knowl- edge in power systems. Processes13(11), 3714 (2025) https://doi.org/10.3390/ pr13113714

2025
[20]

Journal of Intelligent Manufactur- ing (2025) https://doi.org/10.1007/s10845-025-02767-2

Tran, N.-Q., Nguyen, H.-C., Mach, B.-N., Nguyen, N.N., Nguyen, T.Q.: Mobilevit- slm: real-time edge-deployable cnn–transformer hybrid for fine-grained scan line defect classification in additive manufacturing. Journal of Intelligent Manufactur- ing (2025) https://doi.org/10.1007/s10845-025-02767-2

work page doi:10.1007/s10845-025-02767-2 2025
[21]

Engineering Applications of Artificial Intelligence 131, 107836 (2024) https://doi.org/10.1016/j.engappai.2023.107836

Dwivedi, D., Babu, K.V.S.M., Yemula, P.K., Chakraborty, P., Pal, M.: Identifica- tion of surface defects on solar pv panels and wind turbine blades using attention based deep learning model. Engineering Applications of Artificial Intelligence 131, 107836 (2024) https://doi.org/10.1016/j.engappai.2023.107836

work page doi:10.1016/j.engappai.2023.107836 2024
[22]

In: Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pp

Jiang, Y., Lu, X., Jin, Q., Sun, Q., Wu, H., Zhuo, C.: Fabgpt: An efficient large multimodal model for complex wafer defect knowledge queries. In: Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pp. 1–8. ACM, ??? (2024). https://doi.org/10.1145/3676536.3676750

work page doi:10.1145/3676536.3676750 2024
[23]

Visual Intelligence2(1), 17 (2024) https://doi.org/10.1007/s44267-024-00050-1

Jiang, Y., Yan, X., Ji, G.-P., Fu, K., Sun, M., Xiong, H., Fan, D.-P., Khan, F.S.: Effectiveness assessment of recent large vision-language models. Visual Intelligence2(1), 17 (2024) https://doi.org/10.1007/s44267-024-00050-1

work page doi:10.1007/s44267-024-00050-1 2024
[24]

IEEE Access13, 117914–117942 (2025) https://doi.org/10.1109/ACCESS

Bukhary, N., Ahmad, M., Rashad, K., Rai, S., Shapsough, S., Kaddoura, Y., Dghaym, D., Zualkernan, I.: Few-shot evaluation of vision language models for detecting visual defects in autonomous vehicle software requirement specifica- tions. IEEE Access13, 117914–117942 (2025) https://doi.org/10.1109/ACCESS. 2025.3586554

work page doi:10.1109/access 2025
[25]

Scientific Reports15(1), 40600 (2025) https://doi.org/10.1038/s41598-025-24260-9

Wang, Q., Wang, D., Lu, J., Xiao, G., Liang, D., Lu, G., Shao, H.: Sal-yolo- deepseek: a lightweight real-time detection and llm-driven decision framework for intelligent escalator safety monitoring. Scientific Reports15(1), 40600 (2025) https://doi.org/10.1038/s41598-025-24260-9

work page doi:10.1038/s41598-025-24260-9 2025
[26]

Journal of Advanced Transportation2026(1) (2026) https: //doi.org/10.1155/atr/2814128

Zhao, Y., Ma, T., Wang, Z., Zhang, Z., Li, C., Liu, S., Cui, Z., Lv, M., Yu, H., Peng, Z.: A multiview-integrated framework for traffic scene understanding based on yolo and llm. Journal of Advanced Transportation2026(1) (2026) https: //doi.org/10.1155/atr/2814128

work page doi:10.1155/atr/2814128 2026
[27]

Advanced Engineering Informatics66, 103478 (2025) https://doi.org/10.1016/j.aei.2025.103478

Chen, Q., Yin, X.: Tailored vision-language framework for automated hazard identification and report generation in construction sites. Advanced Engineering Informatics66, 103478 (2025) https://doi.org/10.1016/j.aei.2025.103478

work page doi:10.1016/j.aei.2025.103478 2025
[28]

Proceedings of the AAAI Conference on Artificial Intelligence40(31), 26787–26795 (2026) 21 https://doi.org/10.1609/aaai.v40i31.39889

Wang, Z., Fan, Z., Tan, S., Zhong, Y., Yuan, Y., Li, H., Jiang, H., Zhang, W., Shao, F., Wang, H., Xiao, J.: Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation. Proceedings of the AAAI Conference on Artificial Intelligence40(31), 26787–26795 (2026) 21 https://doi.org/10.1609/aaai.v40i31.39889

work page doi:10.1609/aaai.v40i31.39889 2026
[29]

ACM Computing Surveys 57(8), 1–35 (2025) https://doi.org/10.1145/3719664

Zheng, Y., Chen, Y., Qian, B., Shi, X., Shu, Y., Chen, J.: A review on edge large language models: Design, execution, and applications. ACM Computing Surveys 57(8), 1–35 (2025) https://doi.org/10.1145/3719664

work page doi:10.1145/3719664 2025
[30]

Agriculture15(15), 1712 (2025) https: //doi.org/10.3390/agriculture15151712

Gao, L., Ran, T., Zou, H., Wu, H.: Cotton leaf disease detection using llm- synthetic data and demm-yolo model. Agriculture15(15), 1712 (2025) https: //doi.org/10.3390/agriculture15151712

work page doi:10.3390/agriculture15151712 2025
[31]

Zhao, J.: Cognitive-yolo: Llm-driven architecture synthesis from first principles of data for object detection (2025) https://doi.org/10.48550/arXiv.2512.12281

work page doi:10.48550/arxiv.2512.12281 2025
[32]

Journal of Quality in Maintenance Engineering32(1), 269–290 (2026) https://doi.org/10.1108/ JQME-05-2025-0055

Nagrani, S., Narwane, V.S.: An exploration of factors influencing the adop- tion of digital twin technology in predictive maintenance. Journal of Quality in Maintenance Engineering32(1), 269–290 (2026) https://doi.org/10.1108/ JQME-05-2025-0055

2026
[33]

Intelligent Systems with Applications26, 200535 (2025) https://doi.org/10.1016/j.iswa.2025.200535

Leon-Medina, J.X., Tibaduiza, D.A., Par´ es, N., Pozo, F.: Digital twin technology in wind turbine components: A review. Intelligent Systems with Applications26, 200535 (2025) https://doi.org/10.1016/j.iswa.2025.200535

work page doi:10.1016/j.iswa.2025.200535 2025
[34]

Journal of Manufacturing Systems71, 581–594 (2023) https://doi.org/10.1016/j.jmsy.2023

Chen, C., Fu, H., Zheng, Y., Tao, F., Liu, Y.: The advance of digital twin for predictive maintenance: The role and function of machine learning. Journal of Manufacturing Systems71, 581–594 (2023) https://doi.org/10.1016/j.jmsy.2023. 10.010

work page doi:10.1016/j.jmsy.2023 2023
[35]

PeerJ Computer Science10, 1943 (2024) https://doi.org/10.7717/peerj-cs.1943

Abd Wahab, N.H., Hasikin, K., Lai, K.W., Xia, K., Bei, L., Huang, K., Wu, X.: Systematic review of predictive maintenance and digital twin technologies challenges, opportunities, and best practices. PeerJ Computer Science10, 1943 (2024) https://doi.org/10.7717/peerj-cs.1943

work page doi:10.7717/peerj-cs.1943 1943
[36]

In: 2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp

Chen, Z., Fu, H., Zeng, Z.: A domain adaptation neural network for digital twin-supported fault diagnosis. In: 2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp. 1–6. IEEE, ??? (2025). https://doi.org/ 10.1109/ICCAD64771.2025.11099349

work page doi:10.1109/iccad64771.2025.11099349 2025
[37]

International Journal of Intelligent Robotics and Applications (2025) https://doi.org/10.1007/s41315-025-00509-4

Hnaien, I.B., Gascard, E., Simeu-Abazi, Z., Dhouibi, H., Duong, Q.B.: Unsu- pervised anomaly detection in robotic systems via high-fidelity digital twins and deep autoencoders. International Journal of Intelligent Robotics and Applications (2025) https://doi.org/10.1007/s41315-025-00509-4

work page doi:10.1007/s41315-025-00509-4 2025
[38]

Applied Sciences15(6), 3166 (2025) https://doi.org/10.3390/ app15063166 22

Miko lajewska, E., Miko lajewski, D., Miko lajczyk, T., Paczkowski, T.: Genera- tive ai in ai-based digital twins for fault diagnosis for predictive maintenance in industry 4.0/5.0. Applied Sciences15(6), 3166 (2025) https://doi.org/10.3390/ app15063166 22

2025
[39]

Engineering Science9(3), 60–70 (2024) https://doi.org/10.11648/j.es.20240903

Gomaa, A.: Digital twins for improving proactive maintenance management. Engineering Science9(3), 60–70 (2024) https://doi.org/10.11648/j.es.20240903. 12

work page doi:10.11648/j.es.20240903 2024
[40]

Mendeley Data

Shihavuddin, A., Chen, X.: DTU – Drone inspection images of wind tur- bine. Mendeley Data. Version 2. Mendeley Data. https://doi.org/10.17632/ hd96prn3nc.2 (2018)

2018
[41]

GitHub (2023)

Gohar, I.: DTU-annotations: Annotations for the DTU Wind Turbine Images Dataset. GitHub (2023)

2023
[42]

Open source software available from https://github.com/ HumanSignal/label-studio (2020–2025)

Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software. Open source software available from https://github.com/ HumanSignal/label-studio (2020–2025)

2020
[43]

Sensors25(10), 3072 (2025) https://doi.org/10.3390/s25103072

Wang, T., Zhang, B., Jiang, D., Li, D.: A multimodal large language model framework for intelligent perception and decision-making in smart manufacturing. Sensors25(10), 3072 (2025) https://doi.org/10.3390/s25103072

work page doi:10.3390/s25103072 2025
[44]

In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp

Yu, Y., Zutty, J.: Llm-guided evolution: An autonomous model optimization for object detection. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2363–2370. ACM, ??? (2025). https://doi.org/10. 1145/3712255.3734340 23

work page arXiv 2025

[1] [1]

The impact of individual head-related transfer function augmen- tation on spatial release from masking,

Zhong, D., Xia, Z., Zhu, Y., Duan, J.: Overview of predictive maintenance based on digital twin technology. Heliyon9(4), 14534 (2023) https://doi.org/10.1016/j. heliyon.2023.e14534

work page doi:10.1016/j 2023

[2] [2]

Future Internet17(11), 528 (2025) https://doi.org/10.3390/fi17110528

Hamdi, A., Noura, H.N.: Ai-driven damage detection in wind turbines: Drone imagery and lightweight deep learning approaches. Future Internet17(11), 528 (2025) https://doi.org/10.3390/fi17110528

work page doi:10.3390/fi17110528 2025

[3] [3]

Mea- surement Science and Technology36(9), 095416 (2025) https://doi.org/10.1088/ 1361-6501/ae08db

Si, Y., Ding, Y., Ge, F., Wu, X., Liu, J., Ding, D., Zhang, H.: A multi-scale defect detection network for wind turbines utilizing margin aware features. Mea- surement Science and Technology36(9), 095416 (2025) https://doi.org/10.1088/ 1361-6501/ae08db

2025

[4] [4]

Engineering, Technology & Applied Science Research15(6), 30267–30276 (2025) https://doi.org/10.48084/etasr.14220

Zheng, B., Angkawisittpan, N., Huang, L., Sonasang, S.: An improved yolov11n algorithm with conv2former and pw-iou for uav inspection of power line insula- tors. Engineering, Technology & Applied Science Research15(6), 30267–30276 (2025) https://doi.org/10.48084/etasr.14220

work page doi:10.48084/etasr.14220 2025

[5] [5]

Applied Sciences15(11), 6117 (2025) https://doi.org/10.3390/ app15116117

Deng, Z., Li, X., Yang, R.: Rml-yolo: An insulator defect detection method for uav aerial images. Applied Sciences15(11), 6117 (2025) https://doi.org/10.3390/ app15116117

2025

[6] [6]

Proceedings of the AAAI Conference on Artificial Intelligence38(3), 1932–1940 (2024) https://doi.org/10

Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. Proceedings of the AAAI Conference on Artificial Intelligence38(3), 1932–1940 (2024) https://doi.org/10. 1609/aaai.v38i3.27963

1932

[7] [7]

& Sung, J

Cai, W., Huang, W., Cao, Y., Huang, C., Yuan, F., Zhang, B., Wen, J.: Towards vlm-based hybrid explainable prompt enhancement for zero-shot indus- trial anomaly detection. In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pp. 711–719. International Joint Conferences on Artificial Intelligence Organization, ???...

work page doi:10.24963/ijcai 2025

[8] [8]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 19 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 19 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

work page doi:10.1109/tpami.2024.3369699 2024

[9] [9]

International Journal of Computer Vision133(6), 3689–3726 (2025) https://doi

Yang, M., Wang, Z.: Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision133(6), 3689–3726 (2025) https://doi. org/10.1007/s11263-025-02357-y

work page doi:10.1007/s11263-025-02357-y 2025

[10] [10]

https://doi.org/10.2139/ssrn

Bai, Y., Zhang, J., Dong, Y., Cao, Y., Tian, G.: Dual-Path Frequency Discrim- inators for Few-Shot Anomaly Detection (2024). https://doi.org/10.2139/ssrn. 4862099

work page doi:10.2139/ssrn 2024

[11] [11]

https://docs.ultralytics.com/models/yolo26/ (2026)

Ultralytics: Ultralytics YOLO26. https://docs.ultralytics.com/models/yolo26/ (2026)

2026

[12] [12]

arXiv preprint arXiv:2509.25164 (2025)

Sapkota, R., Cheppally, R.H., Sharda, A., Karkee, M.: YOLO26: Key archi- tectural enhancements and performance benchmarking for real-time object detection. arXiv preprint arXiv:2509.25164 (2025)

work page arXiv 2025

[13] [13]

Interna- tional Journal of Computer Vision60(2), 91–110 (2004) https://doi.org/10.1023/ B:VISI.0000029664.99615.94

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision60(2), 91–110 (2004) https://doi.org/10.1023/ B:VISI.0000029664.99615.94

work page arXiv 2004

[14] [14]

IEEE Transactions on Pattern Analy- sis and Machine Intelligence39(6), 1137–1149 (2017) https://doi.org/10.1109/ TPAMI.2016.2577031

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analy- sis and Machine Intelligence39(6), 1137–1149 (2017) https://doi.org/10.1109/ TPAMI.2016.2577031

work page arXiv 2017

[15] [15]

International Journal of Precision Engineering and Manufacturing-Green Technology9(2), 661–691 (2022) https://doi.org/10.1007/ s40684-021-00343-6

Ren, Z., Fang, F., Yan, N., Wu, Y.: State of the art in defect detection based on machine vision. International Journal of Precision Engineering and Manufacturing-Green Technology9(2), 661–691 (2022) https://doi.org/10.1007/ s40684-021-00343-6

2022

[16] [16]

Journal of Advanced Research35, 33–48 (2022) https://doi.org/10.1016/j.jare.2021.03.015

Tulbure, A.-A., Tulbure, A.-A., Dulf, E.-H.: A review on modern defect detection models using dcnns – deep convolutional neural networks. Journal of Advanced Research35, 33–48 (2022) https://doi.org/10.1016/j.jare.2021.03.015

work page doi:10.1016/j.jare.2021.03.015 2022

[17] [17]

Renewable Energy253, 123489 (2025) https://doi.org/10.1016/j.renene.2025.123489

Zhao, B., Li, X., Wang, G., Gao, H., Lv, C., Cao, S.: End-to-end wind turbine damage detection model based on multi-branch feature sensing and contextual information reuse in harsh environments. Renewable Energy253, 123489 (2025) https://doi.org/10.1016/j.renene.2025.123489

work page doi:10.1016/j.renene.2025.123489 2025

[18] [18]

IEEE Transactions on Instrumentation and Measurement67(2), 257–269 (2018) https://doi.org/10.1109/TIM.2017.2775345

Chen, J., Liu, Z., Wang, H., Nunez, A., Han, Z.: Automatic defect detection of fasteners on the catenary support device using deep convolutional neural network. IEEE Transactions on Instrumentation and Measurement67(2), 257–269 (2018) https://doi.org/10.1109/TIM.2017.2775345

work page doi:10.1109/tim.2017.2775345 2018

[19] [19]

Processes13(11), 3714 (2025) https://doi.org/10.3390/ pr13113714

Liu, S., Zhang, W., Yuan, S., Bao, H., Mao, W., Xi, S.: A lightweight model for 20 insulator defect detection based on vision–language modeling and prior knowl- edge in power systems. Processes13(11), 3714 (2025) https://doi.org/10.3390/ pr13113714

2025

[20] [20]

Journal of Intelligent Manufactur- ing (2025) https://doi.org/10.1007/s10845-025-02767-2

Tran, N.-Q., Nguyen, H.-C., Mach, B.-N., Nguyen, N.N., Nguyen, T.Q.: Mobilevit- slm: real-time edge-deployable cnn–transformer hybrid for fine-grained scan line defect classification in additive manufacturing. Journal of Intelligent Manufactur- ing (2025) https://doi.org/10.1007/s10845-025-02767-2

work page doi:10.1007/s10845-025-02767-2 2025

[21] [21]

Engineering Applications of Artificial Intelligence 131, 107836 (2024) https://doi.org/10.1016/j.engappai.2023.107836

Dwivedi, D., Babu, K.V.S.M., Yemula, P.K., Chakraborty, P., Pal, M.: Identifica- tion of surface defects on solar pv panels and wind turbine blades using attention based deep learning model. Engineering Applications of Artificial Intelligence 131, 107836 (2024) https://doi.org/10.1016/j.engappai.2023.107836

work page doi:10.1016/j.engappai.2023.107836 2024

[22] [22]

In: Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pp

Jiang, Y., Lu, X., Jin, Q., Sun, Q., Wu, H., Zhuo, C.: Fabgpt: An efficient large multimodal model for complex wafer defect knowledge queries. In: Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pp. 1–8. ACM, ??? (2024). https://doi.org/10.1145/3676536.3676750

work page doi:10.1145/3676536.3676750 2024

[23] [23]

Visual Intelligence2(1), 17 (2024) https://doi.org/10.1007/s44267-024-00050-1

Jiang, Y., Yan, X., Ji, G.-P., Fu, K., Sun, M., Xiong, H., Fan, D.-P., Khan, F.S.: Effectiveness assessment of recent large vision-language models. Visual Intelligence2(1), 17 (2024) https://doi.org/10.1007/s44267-024-00050-1

work page doi:10.1007/s44267-024-00050-1 2024

[24] [24]

IEEE Access13, 117914–117942 (2025) https://doi.org/10.1109/ACCESS

Bukhary, N., Ahmad, M., Rashad, K., Rai, S., Shapsough, S., Kaddoura, Y., Dghaym, D., Zualkernan, I.: Few-shot evaluation of vision language models for detecting visual defects in autonomous vehicle software requirement specifica- tions. IEEE Access13, 117914–117942 (2025) https://doi.org/10.1109/ACCESS. 2025.3586554

work page doi:10.1109/access 2025

[25] [25]

Scientific Reports15(1), 40600 (2025) https://doi.org/10.1038/s41598-025-24260-9

Wang, Q., Wang, D., Lu, J., Xiao, G., Liang, D., Lu, G., Shao, H.: Sal-yolo- deepseek: a lightweight real-time detection and llm-driven decision framework for intelligent escalator safety monitoring. Scientific Reports15(1), 40600 (2025) https://doi.org/10.1038/s41598-025-24260-9

work page doi:10.1038/s41598-025-24260-9 2025

[26] [26]

Journal of Advanced Transportation2026(1) (2026) https: //doi.org/10.1155/atr/2814128

Zhao, Y., Ma, T., Wang, Z., Zhang, Z., Li, C., Liu, S., Cui, Z., Lv, M., Yu, H., Peng, Z.: A multiview-integrated framework for traffic scene understanding based on yolo and llm. Journal of Advanced Transportation2026(1) (2026) https: //doi.org/10.1155/atr/2814128

work page doi:10.1155/atr/2814128 2026

[27] [27]

Advanced Engineering Informatics66, 103478 (2025) https://doi.org/10.1016/j.aei.2025.103478

Chen, Q., Yin, X.: Tailored vision-language framework for automated hazard identification and report generation in construction sites. Advanced Engineering Informatics66, 103478 (2025) https://doi.org/10.1016/j.aei.2025.103478

work page doi:10.1016/j.aei.2025.103478 2025

[28] [28]

Proceedings of the AAAI Conference on Artificial Intelligence40(31), 26787–26795 (2026) 21 https://doi.org/10.1609/aaai.v40i31.39889

Wang, Z., Fan, Z., Tan, S., Zhong, Y., Yuan, Y., Li, H., Jiang, H., Zhang, W., Shao, F., Wang, H., Xiao, J.: Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation. Proceedings of the AAAI Conference on Artificial Intelligence40(31), 26787–26795 (2026) 21 https://doi.org/10.1609/aaai.v40i31.39889

work page doi:10.1609/aaai.v40i31.39889 2026

[29] [29]

ACM Computing Surveys 57(8), 1–35 (2025) https://doi.org/10.1145/3719664

Zheng, Y., Chen, Y., Qian, B., Shi, X., Shu, Y., Chen, J.: A review on edge large language models: Design, execution, and applications. ACM Computing Surveys 57(8), 1–35 (2025) https://doi.org/10.1145/3719664

work page doi:10.1145/3719664 2025

[30] [30]

Agriculture15(15), 1712 (2025) https: //doi.org/10.3390/agriculture15151712

Gao, L., Ran, T., Zou, H., Wu, H.: Cotton leaf disease detection using llm- synthetic data and demm-yolo model. Agriculture15(15), 1712 (2025) https: //doi.org/10.3390/agriculture15151712

work page doi:10.3390/agriculture15151712 2025

[31] [31]

Zhao, J.: Cognitive-yolo: Llm-driven architecture synthesis from first principles of data for object detection (2025) https://doi.org/10.48550/arXiv.2512.12281

work page doi:10.48550/arxiv.2512.12281 2025

[32] [32]

Journal of Quality in Maintenance Engineering32(1), 269–290 (2026) https://doi.org/10.1108/ JQME-05-2025-0055

Nagrani, S., Narwane, V.S.: An exploration of factors influencing the adop- tion of digital twin technology in predictive maintenance. Journal of Quality in Maintenance Engineering32(1), 269–290 (2026) https://doi.org/10.1108/ JQME-05-2025-0055

2026

[33] [33]

Intelligent Systems with Applications26, 200535 (2025) https://doi.org/10.1016/j.iswa.2025.200535

Leon-Medina, J.X., Tibaduiza, D.A., Par´ es, N., Pozo, F.: Digital twin technology in wind turbine components: A review. Intelligent Systems with Applications26, 200535 (2025) https://doi.org/10.1016/j.iswa.2025.200535

work page doi:10.1016/j.iswa.2025.200535 2025

[34] [34]

Journal of Manufacturing Systems71, 581–594 (2023) https://doi.org/10.1016/j.jmsy.2023

Chen, C., Fu, H., Zheng, Y., Tao, F., Liu, Y.: The advance of digital twin for predictive maintenance: The role and function of machine learning. Journal of Manufacturing Systems71, 581–594 (2023) https://doi.org/10.1016/j.jmsy.2023. 10.010

work page doi:10.1016/j.jmsy.2023 2023

[35] [35]

PeerJ Computer Science10, 1943 (2024) https://doi.org/10.7717/peerj-cs.1943

Abd Wahab, N.H., Hasikin, K., Lai, K.W., Xia, K., Bei, L., Huang, K., Wu, X.: Systematic review of predictive maintenance and digital twin technologies challenges, opportunities, and best practices. PeerJ Computer Science10, 1943 (2024) https://doi.org/10.7717/peerj-cs.1943

work page doi:10.7717/peerj-cs.1943 1943

[36] [36]

In: 2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp

Chen, Z., Fu, H., Zeng, Z.: A domain adaptation neural network for digital twin-supported fault diagnosis. In: 2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp. 1–6. IEEE, ??? (2025). https://doi.org/ 10.1109/ICCAD64771.2025.11099349

work page doi:10.1109/iccad64771.2025.11099349 2025

[37] [37]

International Journal of Intelligent Robotics and Applications (2025) https://doi.org/10.1007/s41315-025-00509-4

Hnaien, I.B., Gascard, E., Simeu-Abazi, Z., Dhouibi, H., Duong, Q.B.: Unsu- pervised anomaly detection in robotic systems via high-fidelity digital twins and deep autoencoders. International Journal of Intelligent Robotics and Applications (2025) https://doi.org/10.1007/s41315-025-00509-4

work page doi:10.1007/s41315-025-00509-4 2025

[38] [38]

Applied Sciences15(6), 3166 (2025) https://doi.org/10.3390/ app15063166 22

Miko lajewska, E., Miko lajewski, D., Miko lajczyk, T., Paczkowski, T.: Genera- tive ai in ai-based digital twins for fault diagnosis for predictive maintenance in industry 4.0/5.0. Applied Sciences15(6), 3166 (2025) https://doi.org/10.3390/ app15063166 22

2025

[39] [39]

Engineering Science9(3), 60–70 (2024) https://doi.org/10.11648/j.es.20240903

Gomaa, A.: Digital twins for improving proactive maintenance management. Engineering Science9(3), 60–70 (2024) https://doi.org/10.11648/j.es.20240903. 12

work page doi:10.11648/j.es.20240903 2024

[40] [40]

Mendeley Data

Shihavuddin, A., Chen, X.: DTU – Drone inspection images of wind tur- bine. Mendeley Data. Version 2. Mendeley Data. https://doi.org/10.17632/ hd96prn3nc.2 (2018)

2018

[41] [41]

GitHub (2023)

Gohar, I.: DTU-annotations: Annotations for the DTU Wind Turbine Images Dataset. GitHub (2023)

2023

[42] [42]

Open source software available from https://github.com/ HumanSignal/label-studio (2020–2025)

Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software. Open source software available from https://github.com/ HumanSignal/label-studio (2020–2025)

2020

[43] [43]

Sensors25(10), 3072 (2025) https://doi.org/10.3390/s25103072

Wang, T., Zhang, B., Jiang, D., Li, D.: A multimodal large language model framework for intelligent perception and decision-making in smart manufacturing. Sensors25(10), 3072 (2025) https://doi.org/10.3390/s25103072

work page doi:10.3390/s25103072 2025

[44] [44]

In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp

Yu, Y., Zutty, J.: Llm-guided evolution: An autonomous model optimization for object detection. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2363–2370. ACM, ??? (2025). https://doi.org/10. 1145/3712255.3734340 23

work page arXiv 2025