Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

Disha Patel

arxiv: 2606.24173 · v1 · pith:BSZFXW6Rnew · submitted 2026-06-23 · 💻 cs.LG · cs.AI

Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

Disha Patel This is my paper

Pith reviewed 2026-06-26 00:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords fault detectionlightweight transformerson-device deploymentbenchmarkpredictive maintenanceadaptive inferencequantizationclass imbalance

0 comments

The pith

Lightweight transformers match traditional ML at 87.8% F1 on turbofan fault data but at 100x model size and 9000x latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks traditional ML methods against lightweight transformer architectures for binary fault detection on three public datasets, focusing on accuracy, model size, and CPU latency for resource-constrained devices. On the C-MAPSS turbofan degradation dataset, transformers reach equivalent performance to Random Forest and XGBoost yet incur far higher costs in size and speed. An adaptive two-stage pipeline that quantizes a small triage model and routes only difficult cases to a larger expert achieves comparable accuracy at much lower average latency. All approaches struggle on the severely imbalanced SECOM and UCI AI4I datasets, exposing limits in handling extreme class imbalance.

Core claim

On the C-MAPSS dataset, lightweight transformers match traditional ML at 87.8% F1 but at 100x the model size and 9000x the latency. TinyBERT-4L is the most deployment-friendly transformer at 55 MB and 18 ms CPU latency. INT8 quantization reduces size by 25% while preserving 86.9% F1. The adaptive pipeline, routing 97.9% of predictions through a quantized triage model and only 2.1% to a larger expert, achieves 87.6% F1 at 19.5 ms average latency. On severely imbalanced datasets both traditional and transformer methods perform poorly.

What carries the argument

The two-stage adaptive inference pipeline that routes most cases through a small quantized triage model and reserves a larger expert model for the remaining cases.

If this is right

INT8 dynamic quantization cuts model size by 25% with only a small drop in F1 on C-MAPSS.
TinyBERT-4L delivers the best size-latency balance among the tested transformers at 55 MB and 18 ms.
The adaptive pipeline keeps overall accuracy near the best single-model result while lowering average latency to 19.5 ms.
Neither traditional ML nor lightweight transformers overcome extreme class imbalance on SECOM and UCI-PM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Specialized techniques for class imbalance may be required before these models can be deployed on real manufacturing lines.
Hybrid routing strategies could generalize to other sensor-based monitoring tasks where most samples are routine.
Repeating the benchmark on mobile GPUs or dedicated edge accelerators would clarify whether CPU latency numbers translate to production hardware.

Load-bearing premise

The three public datasets adequately represent the distribution and challenges of real-world fault detection tasks on resource-constrained devices, and that the evaluation metrics are measured under consistent and fair conditions across all models.

What would settle it

Running the same models on a fresh dataset collected from operating industrial equipment that exhibits known class imbalance and measuring end-to-end latency on actual edge hardware would show whether the reported accuracy-latency tradeoffs hold.

Figures

Figures reproduced from arXiv: 2606.24173 by Disha Patel.

**Figure 3.** Figure 3: Accuracy vs. CPU inference latency on C-MAPSS. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 2.** Figure 2: F1-Score heatmap across all model configurations and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

On-device fault detection enables real-time diagnostics without cloud dependency, but deploying machine learning models on resource-constrained hardware demands careful tradeoffs between accuracy, latency, and model size. We present a benchmark comparing traditional ML methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer architectures (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) for binary fault detection across three public datasets: NASA C-MAPSS turbofan degradation, SECOM semiconductor manufacturing, and UCI AI4I 2020 predictive maintenance. We evaluate classification performance (F1-score, AUC), model size, and CPU inference latency, and further assess INT8 dynamic quantization and a two-stage adaptive inference pipeline. Our results reveal that on well-separated sensor data (C-MAPSS), lightweight transformers match traditional ML at 87.8% F1 but at 100x the model size and 9000x the latency. TinyBERT-4L emerges as the most deployment-friendly transformer at 55 MB and 18 ms CPU latency. INT8 quantization reduces size by 25% while preserving 86.9% F1. Our adaptive pipeline, routing 97.9% of predictions through a quantized triage model and only 2.1% to a larger expert, achieves 87.6% F1 at 19.5 ms average latency. On severely imbalanced datasets (SECOM, UCI-PM), both traditional and transformer methods struggle significantly, highlighting fundamental limitations of current approaches for extreme class imbalance in fault detection. All code is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lightweight transformers offer no clear win over traditional ML for these fault detection tasks, but the benchmark and adaptive pipeline provide some practical numbers with public code.

read the letter

The one thing to know is that this paper is a benchmark comparing lightweight transformers to classic ML models on fault detection, and it finds that the transformers match performance on one dataset but at much higher cost in size and speed, with an adaptive setup helping on latency.

It does well by giving clear quantitative results across F1, AUC, model size, and latency for DistilBERT, TinyBERT variants, MobileBERT against Random Forest and others on C-MAPSS, SECOM, and UCI AI4I. The adaptive pipeline that routes most predictions to a small model is a sensible engineering choice. They also test quantization and note the struggles with imbalanced classes, which shows some care. Public code is available, so others can verify.

The weaker parts are the reliance on public datasets without arguing they represent real deployments, and the lack of detail on adapting transformer architectures to non-text sensor data. CPU latency may not match on-device conditions, and there's no error bars or significance tests mentioned. The novelty is low since it's applying known techniques.

This is for people building on-device systems in industrial maintenance who need data on these tradeoffs. A reader interested in model efficiency for edge AI would get some value from the numbers.

I think it should go to peer review for the empirical contribution and transparency, even if revisions would be needed for more rigorous stats or hardware tests.

Referee Report

0 major / 3 minor

Summary. The manuscript benchmarks traditional ML methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer architectures (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) for binary fault detection on three public datasets (NASA C-MAPSS, SECOM, UCI AI4I 2020). It reports F1/AUC performance, model size, and CPU latency, including results for INT8 quantization and a two-stage adaptive inference pipeline that routes most samples through a triage model. Key results include matching 87.8% F1 on C-MAPSS with large efficiency gaps, 87.6% F1 at 19.5 ms via the adaptive pipeline, and poor performance on imbalanced datasets; all code is released publicly.

Significance. If the benchmark results hold under consistent conditions, the work supplies practical guidance on accuracy-efficiency tradeoffs for on-device fault detection, showing that traditional ML can be preferable to transformers for latency/size constraints while quantization and adaptive routing offer mitigation paths. Public code and explicit caveats on imbalance are strengths that support reproducibility and honest assessment.

minor comments (3)

Abstract: the 9000x latency and 100x size claims should explicitly name the reference traditional ML model and the exact measurement conditions (e.g., batch size, hardware) to allow direct verification.
The evaluation would benefit from reporting standard deviations or multiple random seeds for the F1/AUC numbers, especially given the acknowledged sensitivity to imbalance.
Section describing the adaptive pipeline: clarify the exact routing threshold or confidence criterion used to decide when to invoke the expert model, as this directly affects the reported 2.1% routing rate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our contributions, and the recommendation for minor revision. We appreciate the recognition of the practical guidance on accuracy-efficiency tradeoffs, the value of public code, and the honest caveats regarding class imbalance.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark study reporting direct experimental measurements of F1, AUC, model size, and latency on public datasets using standard models and pipelines. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described claims. All reported numbers (e.g., 87.8% F1, 19.5 ms latency) are presented as outcomes of benchmark runs rather than reductions to prior inputs by construction. The central claims rest on reproducible experimental results with public code noted, satisfying the criteria for a self-contained non-circular evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical benchmark paper, the central claims rest on experimental results rather than axioms or invented entities. No free parameters are explicitly fitted in the abstract description.

pith-pipeline@v0.9.1-grok · 5822 in / 1097 out tokens · 60123 ms · 2026-06-26T00:36:53.453775+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 2 linked inside Pith

[1]

A survey of predic- tive maintenance: Systems, purposes and approaches,

Y . Ran, X. Zhou, P. Lin, Y . Wen, and R. Deng, “A survey of predic- tive maintenance: Systems, purposes and approaches,”arXiv preprint arXiv:1911.10539, 2019

arXiv 1911
[2]

A survey of on-device machine learning,

S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup, and M. Shah, “A survey of on-device machine learning,”ACM Trans. Internet of Things, vol. 2, no. 3, pp. 1–49, 2021

2021
[3]

Learning graph structures with transformer for multivariate time-series anomaly detection,

Z. Chen, D. Chen, X. Zhang, Z. Yuan, and X. Cheng, “Learning graph structures with transformer for multivariate time-series anomaly detection,”IEEE Internet of Things J., 2022

2022
[4]

Efficient transformers: A survey,

Y . Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,”ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, 2022

2022
[5]

Distilling the knowledge in a neural network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[6]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProc. CVPR, 2018, pp. 2704–2713

2018
[7]

Deep compression,

S. Han, H. Mao, and W. J. Dally, “Deep compression,” inProc. ICLR, 2016

2016
[8]

DistilBERT, a distilled version of BERT,

V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT,”arXiv:1910.01108, 2019

Pith/arXiv arXiv 1910
[9]

TinyBERT: Distilling BERT for natural language under- standing,

X. Jiao et al., “TinyBERT: Distilling BERT for natural language under- standing,” inFindings of EMNLP, 2020, pp. 4163–4174

2020
[10]

MobileBERT: a compact task-agnostic BERT for resource- limited devices,

Z. Sun et al., “MobileBERT: a compact task-agnostic BERT for resource- limited devices,” inProc. ACL, 2020, pp. 2158–2170

2020
[11]

TensorFlow Lite Micro: Embedded ML for TinyML systems,

R. David et al., “TensorFlow Lite Micro: Embedded ML for TinyML systems,” inProc. MLSys, 2021

2021
[12]

Damage propagation modeling for aircraft engine run-to-failure simulation,

A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation modeling for aircraft engine run-to-failure simulation,” inProc. PHM, 2008

2008
[13]

SECOM dataset,

M. McCann and A. Johnston, “SECOM dataset,” UCI ML Repository, 2008

2008
[14]

AI4I 2020 predictive maintenance dataset,

S. Matzka, “AI4I 2020 predictive maintenance dataset,” UCI ML Repos- itory, 2020

2020
[15]

Data-driven methods for predictive maintenance,

W. Zhang, D. Yang, and H. Wang, “Data-driven methods for predictive maintenance,”IEEE Systems J., vol. 13, no. 3, pp. 2213–2227, 2019

2019
[16]

Support vector machine in machine condition monitoring,

A. Widodo and B. S. Yang, “Support vector machine in machine condition monitoring,”Mech. Syst. Signal Process., vol. 21, no. 6, pp. 2560–2574, 2007

2007
[17]

XGBoost: A scalable tree boosting system,

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProc. KDD, 2016, pp. 785–794

2016
[18]

Remaining useful life estimation using deep CNNs,

X. Li, Q. Ding, and J. Q. Sun, “Remaining useful life estimation using deep CNNs,”Rel. Eng. Syst. Safety, vol. 172, pp. 1–11, 2018

2018
[19]

LSTM for remaining useful life estimation,

S. Zheng, K. Ristovski, A. Farahat, and C. Gupta, “LSTM for remaining useful life estimation,” inProc. ICPHM, 2017, pp. 88–95

2017
[20]

Autoformer: Decomposition transformers with auto-correlation,

H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition transformers with auto-correlation,” inProc. NeurIPS, 2021

2021

[1] [1]

A survey of predic- tive maintenance: Systems, purposes and approaches,

Y . Ran, X. Zhou, P. Lin, Y . Wen, and R. Deng, “A survey of predic- tive maintenance: Systems, purposes and approaches,”arXiv preprint arXiv:1911.10539, 2019

arXiv 1911

[2] [2]

A survey of on-device machine learning,

S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup, and M. Shah, “A survey of on-device machine learning,”ACM Trans. Internet of Things, vol. 2, no. 3, pp. 1–49, 2021

2021

[3] [3]

Learning graph structures with transformer for multivariate time-series anomaly detection,

Z. Chen, D. Chen, X. Zhang, Z. Yuan, and X. Cheng, “Learning graph structures with transformer for multivariate time-series anomaly detection,”IEEE Internet of Things J., 2022

2022

[4] [4]

Efficient transformers: A survey,

Y . Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,”ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, 2022

2022

[5] [5]

Distilling the knowledge in a neural network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[6] [6]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProc. CVPR, 2018, pp. 2704–2713

2018

[7] [7]

Deep compression,

S. Han, H. Mao, and W. J. Dally, “Deep compression,” inProc. ICLR, 2016

2016

[8] [8]

DistilBERT, a distilled version of BERT,

V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT,”arXiv:1910.01108, 2019

Pith/arXiv arXiv 1910

[9] [9]

TinyBERT: Distilling BERT for natural language under- standing,

X. Jiao et al., “TinyBERT: Distilling BERT for natural language under- standing,” inFindings of EMNLP, 2020, pp. 4163–4174

2020

[10] [10]

MobileBERT: a compact task-agnostic BERT for resource- limited devices,

Z. Sun et al., “MobileBERT: a compact task-agnostic BERT for resource- limited devices,” inProc. ACL, 2020, pp. 2158–2170

2020

[11] [11]

TensorFlow Lite Micro: Embedded ML for TinyML systems,

R. David et al., “TensorFlow Lite Micro: Embedded ML for TinyML systems,” inProc. MLSys, 2021

2021

[12] [12]

Damage propagation modeling for aircraft engine run-to-failure simulation,

A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation modeling for aircraft engine run-to-failure simulation,” inProc. PHM, 2008

2008

[13] [13]

SECOM dataset,

M. McCann and A. Johnston, “SECOM dataset,” UCI ML Repository, 2008

2008

[14] [14]

AI4I 2020 predictive maintenance dataset,

S. Matzka, “AI4I 2020 predictive maintenance dataset,” UCI ML Repos- itory, 2020

2020

[15] [15]

Data-driven methods for predictive maintenance,

W. Zhang, D. Yang, and H. Wang, “Data-driven methods for predictive maintenance,”IEEE Systems J., vol. 13, no. 3, pp. 2213–2227, 2019

2019

[16] [16]

Support vector machine in machine condition monitoring,

A. Widodo and B. S. Yang, “Support vector machine in machine condition monitoring,”Mech. Syst. Signal Process., vol. 21, no. 6, pp. 2560–2574, 2007

2007

[17] [17]

XGBoost: A scalable tree boosting system,

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProc. KDD, 2016, pp. 785–794

2016

[18] [18]

Remaining useful life estimation using deep CNNs,

X. Li, Q. Ding, and J. Q. Sun, “Remaining useful life estimation using deep CNNs,”Rel. Eng. Syst. Safety, vol. 172, pp. 1–11, 2018

2018

[19] [19]

LSTM for remaining useful life estimation,

S. Zheng, K. Ristovski, A. Farahat, and C. Gupta, “LSTM for remaining useful life estimation,” inProc. ICPHM, 2017, pp. 88–95

2017

[20] [20]

Autoformer: Decomposition transformers with auto-correlation,

H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition transformers with auto-correlation,” inProc. NeurIPS, 2021

2021