A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection
Pith reviewed 2026-06-29 18:47 UTC · model grok-4.3
The pith
A three-component pipeline with a 1.5B adapted model generates higher-quality defect reports than a 671B generalist vision-language model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The decoupled Eyes-Bridge-Brain pipeline, with a 4-bit quantized Qwen-2.5-1.5B model adapted via QLoRA on 947 synthetic reports and RAFT for procedure grounding, produces structured JSON reports that achieve BLEU-4 of 0.41, hallucination rate of 4 percent, and expert score of 8.6/10, exceeding both zero-shot VLM baselines and a 671B generalist model given the same detection input.
What carries the argument
The three-part pipeline: YOLO26-x-obb detector for oriented bounding boxes, deterministic Bridge module that encodes boxes into grid-referenced spatial tokens, and QLoRA-adapted 1.5B LLM that converts the prompt into a structured JSON report.
If this is right
- The complete pipeline runs at 47 tokens per second on a single T4-class GPU, enabling edge deployment.
- Ablation results show that removing any one component increases hallucination rate and lowers expert scores.
- The 1.5B QLoRA model produces higher-quality reports than the 671B generalist model when both receive identical detection evidence.
- Retrieval-augmented fine-tuning grounds recommendations in indexed maintenance procedures.
Where Pith is reading between the lines
- The deterministic Bridge encoding of spatial tokens could be added to other vision-language models to reduce spatial hallucinations without retraining the full model.
- If synthetic report generation can be scaled to new industrial domains, the same small-model adaptation approach may apply beyond wind turbines.
- The performance gap between the adapted 1.5B model and the 671B baseline suggests that task-specific structure and domain data matter more than raw parameter count for structured output tasks.
Load-bearing premise
The 947 synthetically generated maintenance reports represent real-world scenarios and the LLM-as-a-Judge scores align with actual expert judgment.
What would settle it
Run the pipeline on a held-out set of real expert-written maintenance reports from actual wind turbine inspections and measure whether BLEU-4, hallucination rate, and expert scores remain higher than the large VLM baseline.
read the original abstract
Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a decoupled three-component pipeline for wind-turbine blade defect inspection and structured report generation: a YOLO26-x-obb detector (Eyes) for oriented bounding-box localization, a deterministic parameter-free Bridge module that encodes detections into grid-referenced spatial tokens, and a 4-bit QLoRA-adapted Qwen-2.5-1.5B model (Brain) fine-tuned on 947 synthetically generated maintenance reports plus RAFT, which produces JSON reports. Five ablations using BLEU-4, ROUGE-L, Hallucination Rate, and LLM-as-Judge scoring show the full pipeline reaching BLEU-4 0.41 / HR=4% / Expert Score 8.6/10 versus 0.07 / 65% / 3.3/10 for a zero-shot VLM baseline and outperforming a 671B generalist model at 47 tokens/s on a T4 GPU.
Significance. If the central claims hold after validation, the work would demonstrate that a purpose-built, edge-deployable decoupled architecture with modest domain-specific adaptation can exceed both zero-shot VLMs and much larger generalist models on structured industrial report generation. The quantitative ablation suite and explicit comparison to a 671B baseline provide concrete evidence for the value of task decomposition and small-model specialization in this domain.
major comments (2)
- [Abstract] Abstract (final two paragraphs): The reported superiority (BLEU-4 0.41, HR=4%, Expert Score 8.6/10) and the claim that the QLoRA-adapted 1.5B model outperforms the 671B generalist model rest entirely on adaptation and evaluation using 947 synthetically generated reports together with an unvalidated LLM-as-Judge rubric. No generation procedure, lexical/structural diversity statistics, or human-expert alignment study for these reports is supplied, which directly undermines the generalization argument that the pipeline performs better on the actual maintenance task.
- [Abstract] Abstract (ablation description): The five ablation experiments compare the complete pipeline against a monolithic VLM baseline and partial configurations, yet the evaluation remains confined to held-out synthetic reports. Without evidence that the synthetic corpus reproduces the lexical, recommendation, and variability distributions of real wind-turbine maintenance reports, the ablation results cannot establish that the decoupled design improves real-world defect reasoning.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the focus on the synthetic data foundation. We address each major comment below. Where the manuscript is incomplete we propose targeted additions; where new empirical validation would be required we note the limitation explicitly.
read point-by-point responses
-
Referee: [Abstract] Abstract (final two paragraphs): The reported superiority (BLEU-4 0.41, HR=4%, Expert Score 8.6/10) and the claim that the QLoRA-adapted 1.5B model outperforms the 671B generalist model rest entirely on adaptation and evaluation using 947 synthetically generated reports together with an unvalidated LLM-as-Judge rubric. No generation procedure, lexical/structural diversity statistics, or human-expert alignment study for these reports is supplied, which directly undermines the generalization argument that the pipeline performs better on the actual maintenance task.
Authors: We agree that the generation procedure and diversity statistics must be supplied. The 947 reports were produced by a deterministic template engine seeded with defect taxonomies and maintenance-action lists drawn from our industrial partner’s historical logs; each template was then varied by sampling from a small set of lexical paraphrases and recommendation phrasings. We will add a new subsection (3.3) that documents the template grammar, the sampling procedure, and quantitative diversity measures (type-token ratio, n-gram entropy, and structural variance across JSON fields). We also acknowledge that no separate human-expert alignment study was performed; this is a genuine limitation of the current study and will be stated as such in the revised Limitations paragraph. revision: partial
-
Referee: [Abstract] Abstract (ablation description): The five ablation experiments compare the complete pipeline against a monolithic VLM baseline and partial configurations, yet the evaluation remains confined to held-out synthetic reports. Without evidence that the synthetic corpus reproduces the lexical, recommendation, and variability distributions of real wind-turbine maintenance reports, the ablation results cannot establish that the decoupled design improves real-world defect reasoning.
Authors: The ablations are performed on held-out synthetic reports by design, because the controlled corpus lets us isolate the contribution of each pipeline stage without confounding factors from real-world annotation noise. We will expand the Data section to include a side-by-side comparison of key statistics (average report length, frequency of each defect class, distribution of recommendation verbs) between the synthetic corpus and a small set of redacted real maintenance reports that our partner permitted us to inspect. We cannot, however, release or evaluate on a large public real-report corpus; therefore the claim that the architecture improves real-world performance rests on the assumption that the synthetic distribution is sufficiently representative—an assumption we will now qualify in the text. revision: partial
- A full human-expert alignment study comparing synthetic versus real reports cannot be supplied without additional data access and annotation resources that are outside the scope of the present work.
Circularity Check
No significant circularity; empirical evaluation on held-out data
full rationale
The paper presents a decoupled pipeline evaluated via standard metrics (BLEU-4, ROUGE-L, HR, LLM-as-Judge) on held-out synthetic reports after QLoRA adaptation, with ablations against baselines. No load-bearing step reduces by construction to its inputs, no self-definitional mappings, no fitted parameters renamed as predictions, and no self-citation chains invoked as uniqueness theorems. The central claims rest on independent held-out comparisons rather than tautological reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic maintenance reports can train a model to produce accurate real reports.
Reference graph
Works this paper leans on
-
[1]
Zhong, D., Xia, Z., Zhu, Y., Duan, J.: Overview of predictive maintenance based on digital twin technology. Heliyon9(4), 14534 (2023) https://doi.org/10.1016/j. heliyon.2023.e14534
work page doi:10.1016/j 2023
-
[2]
Future Internet17(11), 528 (2025) https://doi.org/10.3390/fi17110528
Hamdi, A., Noura, H.N.: Ai-driven damage detection in wind turbines: Drone imagery and lightweight deep learning approaches. Future Internet17(11), 528 (2025) https://doi.org/10.3390/fi17110528
-
[3]
Mea- surement Science and Technology36(9), 095416 (2025) https://doi.org/10.1088/ 1361-6501/ae08db
Si, Y., Ding, Y., Ge, F., Wu, X., Liu, J., Ding, D., Zhang, H.: A multi-scale defect detection network for wind turbines utilizing margin aware features. Mea- surement Science and Technology36(9), 095416 (2025) https://doi.org/10.1088/ 1361-6501/ae08db
2025
-
[4]
Zheng, B., Angkawisittpan, N., Huang, L., Sonasang, S.: An improved yolov11n algorithm with conv2former and pw-iou for uav inspection of power line insula- tors. Engineering, Technology & Applied Science Research15(6), 30267–30276 (2025) https://doi.org/10.48084/etasr.14220
-
[5]
Applied Sciences15(11), 6117 (2025) https://doi.org/10.3390/ app15116117
Deng, Z., Li, X., Yang, R.: Rml-yolo: An insulator defect detection method for uav aerial images. Applied Sciences15(11), 6117 (2025) https://doi.org/10.3390/ app15116117
2025
-
[6]
Proceedings of the AAAI Conference on Artificial Intelligence38(3), 1932–1940 (2024) https://doi.org/10
Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. Proceedings of the AAAI Conference on Artificial Intelligence38(3), 1932–1940 (2024) https://doi.org/10. 1609/aaai.v38i3.27963
1932
-
[7]
Cai, W., Huang, W., Cao, Y., Huang, C., Yuan, F., Zhang, B., Wen, J.: Towards vlm-based hybrid explainable prompt enhancement for zero-shot indus- trial anomaly detection. In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pp. 711–719. International Joint Conferences on Artificial Intelligence Organization, ???...
-
[8]
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 19 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699
-
[9]
International Journal of Computer Vision133(6), 3689–3726 (2025) https://doi
Yang, M., Wang, Z.: Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision133(6), 3689–3726 (2025) https://doi. org/10.1007/s11263-025-02357-y
-
[10]
Bai, Y., Zhang, J., Dong, Y., Cao, Y., Tian, G.: Dual-Path Frequency Discrim- inators for Few-Shot Anomaly Detection (2024). https://doi.org/10.2139/ssrn. 4862099
-
[11]
https://docs.ultralytics.com/models/yolo26/ (2026)
Ultralytics: Ultralytics YOLO26. https://docs.ultralytics.com/models/yolo26/ (2026)
2026
-
[12]
arXiv preprint arXiv:2509.25164 (2025)
Sapkota, R., Cheppally, R.H., Sharda, A., Karkee, M.: YOLO26: Key archi- tectural enhancements and performance benchmarking for real-time object detection. arXiv preprint arXiv:2509.25164 (2025)
-
[13]
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision60(2), 91–110 (2004) https://doi.org/10.1023/ B:VISI.0000029664.99615.94
-
[14]
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analy- sis and Machine Intelligence39(6), 1137–1149 (2017) https://doi.org/10.1109/ TPAMI.2016.2577031
-
[15]
International Journal of Precision Engineering and Manufacturing-Green Technology9(2), 661–691 (2022) https://doi.org/10.1007/ s40684-021-00343-6
Ren, Z., Fang, F., Yan, N., Wu, Y.: State of the art in defect detection based on machine vision. International Journal of Precision Engineering and Manufacturing-Green Technology9(2), 661–691 (2022) https://doi.org/10.1007/ s40684-021-00343-6
2022
-
[16]
Journal of Advanced Research35, 33–48 (2022) https://doi.org/10.1016/j.jare.2021.03.015
Tulbure, A.-A., Tulbure, A.-A., Dulf, E.-H.: A review on modern defect detection models using dcnns – deep convolutional neural networks. Journal of Advanced Research35, 33–48 (2022) https://doi.org/10.1016/j.jare.2021.03.015
-
[17]
Renewable Energy253, 123489 (2025) https://doi.org/10.1016/j.renene.2025.123489
Zhao, B., Li, X., Wang, G., Gao, H., Lv, C., Cao, S.: End-to-end wind turbine damage detection model based on multi-branch feature sensing and contextual information reuse in harsh environments. Renewable Energy253, 123489 (2025) https://doi.org/10.1016/j.renene.2025.123489
-
[18]
Chen, J., Liu, Z., Wang, H., Nunez, A., Han, Z.: Automatic defect detection of fasteners on the catenary support device using deep convolutional neural network. IEEE Transactions on Instrumentation and Measurement67(2), 257–269 (2018) https://doi.org/10.1109/TIM.2017.2775345
-
[19]
Processes13(11), 3714 (2025) https://doi.org/10.3390/ pr13113714
Liu, S., Zhang, W., Yuan, S., Bao, H., Mao, W., Xi, S.: A lightweight model for 20 insulator defect detection based on vision–language modeling and prior knowl- edge in power systems. Processes13(11), 3714 (2025) https://doi.org/10.3390/ pr13113714
2025
-
[20]
Journal of Intelligent Manufactur- ing (2025) https://doi.org/10.1007/s10845-025-02767-2
Tran, N.-Q., Nguyen, H.-C., Mach, B.-N., Nguyen, N.N., Nguyen, T.Q.: Mobilevit- slm: real-time edge-deployable cnn–transformer hybrid for fine-grained scan line defect classification in additive manufacturing. Journal of Intelligent Manufactur- ing (2025) https://doi.org/10.1007/s10845-025-02767-2
-
[21]
Dwivedi, D., Babu, K.V.S.M., Yemula, P.K., Chakraborty, P., Pal, M.: Identifica- tion of surface defects on solar pv panels and wind turbine blades using attention based deep learning model. Engineering Applications of Artificial Intelligence 131, 107836 (2024) https://doi.org/10.1016/j.engappai.2023.107836
-
[22]
In: Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pp
Jiang, Y., Lu, X., Jin, Q., Sun, Q., Wu, H., Zhuo, C.: Fabgpt: An efficient large multimodal model for complex wafer defect knowledge queries. In: Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pp. 1–8. ACM, ??? (2024). https://doi.org/10.1145/3676536.3676750
-
[23]
Visual Intelligence2(1), 17 (2024) https://doi.org/10.1007/s44267-024-00050-1
Jiang, Y., Yan, X., Ji, G.-P., Fu, K., Sun, M., Xiong, H., Fan, D.-P., Khan, F.S.: Effectiveness assessment of recent large vision-language models. Visual Intelligence2(1), 17 (2024) https://doi.org/10.1007/s44267-024-00050-1
-
[24]
IEEE Access13, 117914–117942 (2025) https://doi.org/10.1109/ACCESS
Bukhary, N., Ahmad, M., Rashad, K., Rai, S., Shapsough, S., Kaddoura, Y., Dghaym, D., Zualkernan, I.: Few-shot evaluation of vision language models for detecting visual defects in autonomous vehicle software requirement specifica- tions. IEEE Access13, 117914–117942 (2025) https://doi.org/10.1109/ACCESS. 2025.3586554
-
[25]
Scientific Reports15(1), 40600 (2025) https://doi.org/10.1038/s41598-025-24260-9
Wang, Q., Wang, D., Lu, J., Xiao, G., Liang, D., Lu, G., Shao, H.: Sal-yolo- deepseek: a lightweight real-time detection and llm-driven decision framework for intelligent escalator safety monitoring. Scientific Reports15(1), 40600 (2025) https://doi.org/10.1038/s41598-025-24260-9
-
[26]
Journal of Advanced Transportation2026(1) (2026) https: //doi.org/10.1155/atr/2814128
Zhao, Y., Ma, T., Wang, Z., Zhang, Z., Li, C., Liu, S., Cui, Z., Lv, M., Yu, H., Peng, Z.: A multiview-integrated framework for traffic scene understanding based on yolo and llm. Journal of Advanced Transportation2026(1) (2026) https: //doi.org/10.1155/atr/2814128
-
[27]
Advanced Engineering Informatics66, 103478 (2025) https://doi.org/10.1016/j.aei.2025.103478
Chen, Q., Yin, X.: Tailored vision-language framework for automated hazard identification and report generation in construction sites. Advanced Engineering Informatics66, 103478 (2025) https://doi.org/10.1016/j.aei.2025.103478
-
[28]
Wang, Z., Fan, Z., Tan, S., Zhong, Y., Yuan, Y., Li, H., Jiang, H., Zhang, W., Shao, F., Wang, H., Xiao, J.: Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation. Proceedings of the AAAI Conference on Artificial Intelligence40(31), 26787–26795 (2026) 21 https://doi.org/10.1609/aaai.v40i31.39889
-
[29]
ACM Computing Surveys 57(8), 1–35 (2025) https://doi.org/10.1145/3719664
Zheng, Y., Chen, Y., Qian, B., Shi, X., Shu, Y., Chen, J.: A review on edge large language models: Design, execution, and applications. ACM Computing Surveys 57(8), 1–35 (2025) https://doi.org/10.1145/3719664
-
[30]
Agriculture15(15), 1712 (2025) https: //doi.org/10.3390/agriculture15151712
Gao, L., Ran, T., Zou, H., Wu, H.: Cotton leaf disease detection using llm- synthetic data and demm-yolo model. Agriculture15(15), 1712 (2025) https: //doi.org/10.3390/agriculture15151712
-
[31]
Zhao, J.: Cognitive-yolo: Llm-driven architecture synthesis from first principles of data for object detection (2025) https://doi.org/10.48550/arXiv.2512.12281
-
[32]
Journal of Quality in Maintenance Engineering32(1), 269–290 (2026) https://doi.org/10.1108/ JQME-05-2025-0055
Nagrani, S., Narwane, V.S.: An exploration of factors influencing the adop- tion of digital twin technology in predictive maintenance. Journal of Quality in Maintenance Engineering32(1), 269–290 (2026) https://doi.org/10.1108/ JQME-05-2025-0055
2026
-
[33]
Intelligent Systems with Applications26, 200535 (2025) https://doi.org/10.1016/j.iswa.2025.200535
Leon-Medina, J.X., Tibaduiza, D.A., Par´ es, N., Pozo, F.: Digital twin technology in wind turbine components: A review. Intelligent Systems with Applications26, 200535 (2025) https://doi.org/10.1016/j.iswa.2025.200535
-
[34]
Journal of Manufacturing Systems71, 581–594 (2023) https://doi.org/10.1016/j.jmsy.2023
Chen, C., Fu, H., Zheng, Y., Tao, F., Liu, Y.: The advance of digital twin for predictive maintenance: The role and function of machine learning. Journal of Manufacturing Systems71, 581–594 (2023) https://doi.org/10.1016/j.jmsy.2023. 10.010
-
[35]
PeerJ Computer Science10, 1943 (2024) https://doi.org/10.7717/peerj-cs.1943
Abd Wahab, N.H., Hasikin, K., Lai, K.W., Xia, K., Bei, L., Huang, K., Wu, X.: Systematic review of predictive maintenance and digital twin technologies challenges, opportunities, and best practices. PeerJ Computer Science10, 1943 (2024) https://doi.org/10.7717/peerj-cs.1943
-
[36]
In: 2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp
Chen, Z., Fu, H., Zeng, Z.: A domain adaptation neural network for digital twin-supported fault diagnosis. In: 2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp. 1–6. IEEE, ??? (2025). https://doi.org/ 10.1109/ICCAD64771.2025.11099349
-
[37]
Hnaien, I.B., Gascard, E., Simeu-Abazi, Z., Dhouibi, H., Duong, Q.B.: Unsu- pervised anomaly detection in robotic systems via high-fidelity digital twins and deep autoencoders. International Journal of Intelligent Robotics and Applications (2025) https://doi.org/10.1007/s41315-025-00509-4
-
[38]
Applied Sciences15(6), 3166 (2025) https://doi.org/10.3390/ app15063166 22
Miko lajewska, E., Miko lajewski, D., Miko lajczyk, T., Paczkowski, T.: Genera- tive ai in ai-based digital twins for fault diagnosis for predictive maintenance in industry 4.0/5.0. Applied Sciences15(6), 3166 (2025) https://doi.org/10.3390/ app15063166 22
2025
-
[39]
Engineering Science9(3), 60–70 (2024) https://doi.org/10.11648/j.es.20240903
Gomaa, A.: Digital twins for improving proactive maintenance management. Engineering Science9(3), 60–70 (2024) https://doi.org/10.11648/j.es.20240903. 12
-
[40]
Mendeley Data
Shihavuddin, A., Chen, X.: DTU – Drone inspection images of wind tur- bine. Mendeley Data. Version 2. Mendeley Data. https://doi.org/10.17632/ hd96prn3nc.2 (2018)
2018
-
[41]
GitHub (2023)
Gohar, I.: DTU-annotations: Annotations for the DTU Wind Turbine Images Dataset. GitHub (2023)
2023
-
[42]
Open source software available from https://github.com/ HumanSignal/label-studio (2020–2025)
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software. Open source software available from https://github.com/ HumanSignal/label-studio (2020–2025)
2020
-
[43]
Sensors25(10), 3072 (2025) https://doi.org/10.3390/s25103072
Wang, T., Zhang, B., Jiang, D., Li, D.: A multimodal large language model framework for intelligent perception and decision-making in smart manufacturing. Sensors25(10), 3072 (2025) https://doi.org/10.3390/s25103072
-
[44]
In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp
Yu, Y., Zutty, J.: Llm-guided evolution: An autonomous model optimization for object detection. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2363–2370. ACM, ??? (2025). https://doi.org/10. 1145/3712255.3734340 23
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.