A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Alasdair McDonald; Andre Biscaya; Jonathan Shek; Max Malyi

arxiv: 2509.06813 · v1 · submitted 2025-09-08 · 💻 cs.CL

A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Max Malyi , Jonathan Shek , Alasdair McDonald , Andre Biscaya This is my paper

Pith reviewed 2026-05-18 18:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelswind turbine maintenancelog classificationbenchmarkinghuman-in-the-loopoperational dataO&M analysissemantic ambiguity

0 comments

The pith

Large language models perform well on clear parts of wind turbine maintenance logs but need human oversight for reliable results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up an open benchmark to test how well current LLMs can turn free-text wind turbine maintenance records into consistent labels. It measures model alignment to a standard set of categories, checks how well their confidence scores match actual correctness, and shows that every model struggles more when the task involves interpreting actions rather than naming parts. Because no model reaches perfect accuracy and calibration differs sharply across models, the work concludes that the practical next step is a human-in-the-loop workflow in which LLMs speed up and standardise labelling for expert review. This matters for wind energy because better labelled logs directly support reliability analysis and lower operating costs.

Core claim

A systematic comparison of state-of-the-art LLMs on real wind-turbine maintenance logs reveals a clear performance ordering, with the strongest models achieving high agreement on objective component identification yet lower consensus on interpretive maintenance actions; no model reaches full accuracy and calibration quality varies widely, so the authors present human-in-the-loop assistance as the responsible near-term use.

What carries the argument

The open-source benchmarking framework that scores LLMs on alignment to a ground-truth category set and on confidence calibration for unstructured industrial logs.

If this is right

LLMs can accelerate and standardise the labelling of maintenance logs for human experts.
Improved label quality supports better downstream reliability and failure analysis.
The public framework allows repeated testing as new models appear.
Performance remains higher on objective component tasks than on interpretive action tasks across all tested models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same benchmark structure could be reused for maintenance logs from other industrial sectors that produce similar free-text records.
Prompt engineering or retrieval-augmented techniques might narrow the gap on ambiguous action labels without changing the overall human-in-the-loop recommendation.
If calibration improves in future models, the volume of human review needed per log could be reduced while still preserving data quality.

Load-bearing premise

The chosen benchmark categories and sample logs accurately reflect the true meaning and distribution of real operational maintenance records.

What would settle it

Independent expert re-labelling of the same logs followed by comparison to the benchmark standard to check whether the reported performance hierarchy and semantic-ambiguity gap still hold.

Figures

Figures reproduced from arXiv: 2509.06813 by Alasdair McDonald, Andre Biscaya, Jonathan Shek, Max Malyi.

**Figure 1.** Figure 1: Throughput (logs/s) vs. Estimated Cost ($) for API and local models. The left panel shows processing speed (higher is better), with local models in a lighter shade. The right panel shows the financial cost for API models (lower is better). 3.2 Classification Alignment Using gpt-5 as the benchmark, a clear hierarchy of alignment emerged ( [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: Weighted F1-Scores for each model’s alignment with the gpt-5 benchmark, sorted by average performance across both tasks [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Confusion matrix for the Issue Category labels generated by gpt-o3 against the gpt-5 reference. The clear diagonal pattern indicates high agreement. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 6.** Figure 6: A Kappa score above 0.81 indicates high agreement, while scores between 0.61 and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 4.** Figure 4: Distribution of self-reported confidence levels across all models. 4 Discussion A key takeaway from this study is that modern LLMs are exceptionally well-suited for this industrial classification task, representing a significant leap over both traditional natural language processing methods and purely manual processing. Unlike previous techniques that rely on extensive and often brittle feature engineerin… view at source ↗

**Figure 5.** Figure 5: Model Calibration: Average F1-Score vs. Self-Reported Confidence Level. gpt-5 is shown as the benchmark (dimmed bars) [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Average Pairwise Inter-Model Agreement (Cohen’s Kappa). Warmer colours indicate higher agreement between models. The results of this benchmark confirm that while LLMs are powerful tools for industrial data labelling, they are not a monolithic, plug-and-play solution. The observed trade-off between efficiency and reliability highlights that the optimal model choice is application-specific. The 12 [PITH_FU… view at source ↗

read the original abstract

Effective Operation and Maintenance (O&M) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task's semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing O&M data quality and downstream reliability analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical open benchmark for LLMs on wind turbine logs but leaves the reference standard's construction under-specified.

read the letter

Here's the quick read on this one. The paper introduces a new open-source benchmarking framework for using LLMs to classify wind turbine maintenance logs and evaluates several models on it, concluding that human-in-the-loop is the practical path forward. What stands out is the application to real industrial data and the release of the tool. This makes it easier for others to test models on similar tasks without building everything themselves. They report a performance hierarchy, note that models align better on identifying components than on interpreting maintenance actions, and highlight differences in calibration. The human-in-the-loop suggestion follows logically from the imperfect results. The paper does a decent job of addressing a concrete barrier in wind energy operations and maintenance. The soft spots are around the foundation of the results. The performance claims and the semantic ambiguity findings rest on alignment with a benchmark standard, but there's little information on how that standard was constructed, including expert involvement or agreement levels. The selected logs' representativeness also isn't detailed. Without that, it's harder to be confident in the rankings. The abstract avoids giving numbers, which leaves the strength of the findings unclear until the full results are checked. This kind of work is useful for people in renewable energy who need better ways to process maintenance records, or for those applying LLMs to domain-specific text classification. A practitioner or researcher in that area would find the comparisons and the tool valuable. It has enough substance and reproducibility to deserve a serious referee. I'd recommend putting it through peer review, with the main feedback likely focusing on bolstering the validation of the reference standard and including more quantitative details.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a novel reproducible framework for benchmarking LLMs on the classification of unstructured wind turbine maintenance logs. It evaluates a suite of state-of-the-art LLMs, identifies a performance hierarchy based on alignment with a benchmark standard, shows higher model consensus on component identification compared to maintenance actions due to semantic ambiguity, and concludes that a Human-in-the-Loop system is the most effective near-term application.

Significance. This work offers an open-source tool and a systematic evaluation of LLMs for a practical industrial application in wind energy O&M. If the benchmark is validated, it could standardize data labeling, improve reliability analysis, and contribute to reducing LCOE. The emphasis on calibration and human assistance provides responsible guidance for deploying LLMs in high-stakes domains.

major comments (1)

[Abstract and Results] The performance hierarchy, calibration findings, and semantic ambiguity results are all predicated on alignment with a 'benchmark standard.' However, the manuscript provides no information on how this standard was constructed, such as the number of human experts, inter-annotator agreement scores, or the process for resolving discrepancies. Without these details, the validity of the reported model rankings and the recommendation for Human-in-the-Loop systems cannot be fully assessed.

minor comments (1)

[Abstract] The abstract does not include key quantitative details such as the size of the dataset, specific performance metrics (e.g., accuracy, F1 scores), or error bars, which would help readers quickly gauge the scope and robustness of the evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important gap in the transparency of our benchmark construction. We address the major comment below and will revise the manuscript accordingly to strengthen the validity of our results and recommendations.

read point-by-point responses

Referee: [Abstract and Results] The performance hierarchy, calibration findings, and semantic ambiguity results are all predicated on alignment with a 'benchmark standard.' However, the manuscript provides no information on how this standard was constructed, such as the number of human experts, inter-annotator agreement scores, or the process for resolving discrepancies. Without these details, the validity of the reported model rankings and the recommendation for Human-in-the-Loop systems cannot be fully assessed.

Authors: We agree that the current manuscript lacks sufficient detail on the construction of the benchmark standard, which is essential for readers to evaluate the reported performance hierarchy, calibration results, and the Human-in-the-Loop recommendation. In the revised manuscript, we will add a new subsection (likely in the Methods or Data section) that explicitly describes the annotation process. This will include the number of human experts involved, inter-annotator agreement metrics, and the procedure for resolving any discrepancies. These additions will directly address the referee's concern and allow for a more rigorous assessment of our findings without altering the core results. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark against external reference standard

full rationale

The paper is an empirical benchmark study that directly evaluates LLMs against a pre-existing reference standard for labeling maintenance logs. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-referential quantities appear in the abstract or described framework. Performance hierarchy, calibration scores, and semantic-ambiguity comparisons are computed as straightforward alignment metrics with the benchmark standard treated as an independent input. The study contains no self-citation load-bearing uniqueness theorems, ansatz smuggling, or renaming of known results. It is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on standard ML benchmarking assumptions about data representativeness and label quality rather than new free parameters or invented entities.

axioms (1)

domain assumption Maintenance logs contain distinguishable objective and interpretive semantic content that can be classified by prompted LLMs.
Invoked when discussing higher consensus on component identification versus maintenance actions.

pith-pipeline@v0.9.0 · 5766 in / 1138 out tokens · 50709 ms · 2026-05-18T18:00:55.584555+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
cs.AI 2026-05 unverdicted novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · cited by 1 Pith paper

[1]

Cleaning historical maintenance work order data for reliability analysis,

M. Hodkiewicz and M. T.-W. Ho, “Cleaning historical maintenance work order data for reliability analysis,”Journal of Quality in Maintenance Engineering, vol. 22, no. 2, pp. 146–163, May 2016

work page 2016
[2]

Recommended practices for wind farm data collection and reliability assessment for O&M optimization,

B. Hahn, T. Welte, S. Faulstich, P. Bangalore, C. Boussion, K. Harrison, E. Miguelanez- Martin, F. O’Connor, L. Pettersson, C. Soraghan, C. Stock-Williams, J. Dals- gaard Sørensen, G. Van Bussel, and J. Vatn, “Recommended practices for wind farm data collection and reliability assessment for O&M optimization,”Energy Procedia, vol. 137, pp. 358–365, 2017

work page 2017
[3]

Work Orders - Value from Structureless Text in the Era of Digitisation,

E. Salo, D. McMillan, and R. Connor, “Work Orders - Value from Structureless Text in the Era of Digitisation,” inSPE Offshore Europe Conference and Exhibition. Aberdeen, UK: SPE, Sep. 2019

work page 2019
[4]

Digitalization Workflow for Automated Structuring and Standardization of Maintenance Information of Wind Turbines into Domain Standard as a Basis for Reliability KPI Calculation,

M.-A. Lutz, J. Walgern, K. Beckh, J. Schneider, S. Faulstich, and S. Pfaffel, “Digitalization Workflow for Automated Structuring and Standardization of Maintenance Information of Wind Turbines into Domain Standard as a Basis for Reliability KPI Calculation,”Journal of Physics: Conference Series, vol. 2257, no. 1, p. 012004, Apr. 2022

work page 2022
[5]

Analysis of SAP work order data by turbine technology type for onshore wind,

E. Salo, “Analysis of SAP work order data by turbine technology type for onshore wind,” Master’s thesis, University of Strathclyde, Glasgow, UK, 2017

work page 2017
[6]

KPI Extraction from Maintenance Work Orders—A Comparison of Expert Labeling, Text Classification and AI-Assisted Tagging for Computing Failure Rates of Wind Turbines,

M.-A. Lutz, B. Sch¨ afermeier, R. Sexton, M. Sharp, A. Dima, S. Faulstich, and J. M. Aluri, “KPI Extraction from Maintenance Work Orders—A Comparison of Expert Labeling, Text Classification and AI-Assisted Tagging for Computing Failure Rates of Wind Turbines,” Energies, vol. 16, no. 24, p. 7937, Dec. 2023

work page 2023
[7]

Impact of using text classifiers for standardising maintenance data of wind turbines on reliability calculations,

J. Walgern, K. Beckh, N. Hannes, M. Horn, M.-A. Lutz, K. Fischer, and A. Kolios, “Impact of using text classifiers for standardising maintenance data of wind turbines on reliability calculations,”IET Renewable Power Generation, vol. 18, no. 15, pp. 3463–3479, Nov. 2024

work page 2024
[8]

Automatic La- belling with Open-source LLMs using Dynamic Label Schema Integration,

T. Walshe, S. Y. Moon, C. Xiao, Y. Gunawardana, and F. Silavong, “Automatic La- belling with Open-source LLMs using Dynamic Label Schema Integration,”arXiv, no. arXiv:2501.12332, Jan. 2025

work page arXiv 2025
[9]

SafeLLM: Domain-Specific Safety Monitoring for Large Language Models: A Case Study of Offshore Wind Maintenance,

C. Walker, C. Rothon, K. Aslansefat, Y. Papadopoulos, and N. Dethlefs, “SafeLLM: Domain-Specific Safety Monitoring for Large Language Models: A Case Study of Offshore Wind Maintenance,”arXiv, no. arXiv:2410.10852, Oct. 2024

work page arXiv 2024
[10]

Enrichment of Wind Turbine Health History for Condition-Based Maintenance,

R. Cox, “Enrichment of Wind Turbine Health History for Condition-Based Maintenance,” Ph.D. dissertation, Durham University, Durham, UK, 2022. 16

work page 2022

[1] [1]

Cleaning historical maintenance work order data for reliability analysis,

M. Hodkiewicz and M. T.-W. Ho, “Cleaning historical maintenance work order data for reliability analysis,”Journal of Quality in Maintenance Engineering, vol. 22, no. 2, pp. 146–163, May 2016

work page 2016

[2] [2]

Recommended practices for wind farm data collection and reliability assessment for O&M optimization,

B. Hahn, T. Welte, S. Faulstich, P. Bangalore, C. Boussion, K. Harrison, E. Miguelanez- Martin, F. O’Connor, L. Pettersson, C. Soraghan, C. Stock-Williams, J. Dals- gaard Sørensen, G. Van Bussel, and J. Vatn, “Recommended practices for wind farm data collection and reliability assessment for O&M optimization,”Energy Procedia, vol. 137, pp. 358–365, 2017

work page 2017

[3] [3]

Work Orders - Value from Structureless Text in the Era of Digitisation,

E. Salo, D. McMillan, and R. Connor, “Work Orders - Value from Structureless Text in the Era of Digitisation,” inSPE Offshore Europe Conference and Exhibition. Aberdeen, UK: SPE, Sep. 2019

work page 2019

[4] [4]

Digitalization Workflow for Automated Structuring and Standardization of Maintenance Information of Wind Turbines into Domain Standard as a Basis for Reliability KPI Calculation,

M.-A. Lutz, J. Walgern, K. Beckh, J. Schneider, S. Faulstich, and S. Pfaffel, “Digitalization Workflow for Automated Structuring and Standardization of Maintenance Information of Wind Turbines into Domain Standard as a Basis for Reliability KPI Calculation,”Journal of Physics: Conference Series, vol. 2257, no. 1, p. 012004, Apr. 2022

work page 2022

[5] [5]

Analysis of SAP work order data by turbine technology type for onshore wind,

E. Salo, “Analysis of SAP work order data by turbine technology type for onshore wind,” Master’s thesis, University of Strathclyde, Glasgow, UK, 2017

work page 2017

[6] [6]

KPI Extraction from Maintenance Work Orders—A Comparison of Expert Labeling, Text Classification and AI-Assisted Tagging for Computing Failure Rates of Wind Turbines,

M.-A. Lutz, B. Sch¨ afermeier, R. Sexton, M. Sharp, A. Dima, S. Faulstich, and J. M. Aluri, “KPI Extraction from Maintenance Work Orders—A Comparison of Expert Labeling, Text Classification and AI-Assisted Tagging for Computing Failure Rates of Wind Turbines,” Energies, vol. 16, no. 24, p. 7937, Dec. 2023

work page 2023

[7] [7]

Impact of using text classifiers for standardising maintenance data of wind turbines on reliability calculations,

J. Walgern, K. Beckh, N. Hannes, M. Horn, M.-A. Lutz, K. Fischer, and A. Kolios, “Impact of using text classifiers for standardising maintenance data of wind turbines on reliability calculations,”IET Renewable Power Generation, vol. 18, no. 15, pp. 3463–3479, Nov. 2024

work page 2024

[8] [8]

Automatic La- belling with Open-source LLMs using Dynamic Label Schema Integration,

T. Walshe, S. Y. Moon, C. Xiao, Y. Gunawardana, and F. Silavong, “Automatic La- belling with Open-source LLMs using Dynamic Label Schema Integration,”arXiv, no. arXiv:2501.12332, Jan. 2025

work page arXiv 2025

[9] [9]

SafeLLM: Domain-Specific Safety Monitoring for Large Language Models: A Case Study of Offshore Wind Maintenance,

C. Walker, C. Rothon, K. Aslansefat, Y. Papadopoulos, and N. Dethlefs, “SafeLLM: Domain-Specific Safety Monitoring for Large Language Models: A Case Study of Offshore Wind Maintenance,”arXiv, no. arXiv:2410.10852, Oct. 2024

work page arXiv 2024

[10] [10]

Enrichment of Wind Turbine Health History for Condition-Based Maintenance,

R. Cox, “Enrichment of Wind Turbine Health History for Condition-Based Maintenance,” Ph.D. dissertation, Durham University, Durham, UK, 2022. 16

work page 2022