arxiv: 2605.04528 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.MA

Recognition: unknown

YOTOnet: Zero-Shot Cross-Domain Fault Diagnosis via Domain-Conditioned Mixture of Experts

Zesen Wang , Zihao Wu , Yue Hu , Yang Gao , Fuzhen Xuan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:28 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords zero-shot cross-domain learningfault diagnosismixture of expertsbearing faultsdomain adaptationinvariant feature extractionindustrial monitoring

0 comments

The pith

YOTOnet enables a single training run to diagnose bearing faults across different datasets and conditions through invariant features and adaptive expert routing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents YOTOnet as a method to overcome domain shifts that limit deep learning models in mechanical fault diagnosis. It builds a model with a feature distiller that pulls out representations independent of specific equipment or conditions, plus sparse experts that route each input to the right specialized processor using only learned signals. Tests across 30 protocols on five bearing datasets show the model beats existing approaches, and performance rises steadily as the number of training datasets grows from one to four. This pattern supports the idea that principles from large foundation models can transfer to industrial monitoring for more reliable zero-shot use.

Core claim

YOTOnet shows that combining a physics-aware Invariant Feature Distiller with Domain-Conditioned Sparse Experts produces better generalization in cross-domain bearing fault diagnosis. The distiller uses multi-scale dilated convolutions and FFT-based fusion to create domain-agnostic features, while the experts route samples via learned gating without any external metadata; a dual-head setup adds auxiliary supervision. On five public datasets and thirty cross-dataset tests, this yields higher F1 scores than prior methods, with average test F1 rising from 0.5339 when trained on one dataset to 0.705 when trained on four.

What carries the argument

Domain-Conditioned Sparse Experts that adaptively route inputs to specialized processors via learned gating, paired with a physics-aware Invariant Feature Distiller that extracts domain-agnostic representations using multi-scale dilated convolutions and FFT-based time-frequency fusion.

If this is right

Average test F1 rises from 0.5339 with one training dataset to 0.705 with four, confirming a scaling benefit.
The model supports train-once deployment for zero-shot use on new equipment or conditions.
No external metadata is required for the expert routing to function across domains.
The dual-head classification with auxiliary supervision improves the quality of the learned representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure might transfer to fault diagnosis on other rotating machinery such as gears or pumps if the invariant features hold.
Tracking which experts activate for different operating conditions could help identify the most informative sensor placements in new installations.
Further gains may appear if the number of training datasets grows beyond four, following the observed trend from three to four datasets.

Load-bearing premise

The learned gating in the sparse experts routes inputs to the right specialized processors without needing domain labels, and the multi-scale convolutions with FFT fusion truly remove domain-specific information from the features.

What would settle it

On a sixth unseen bearing dataset, YOTOnet trained on the original four datasets fails to exceed the F1 scores of the strongest competing methods that were also trained on those four datasets.

Figures

Figures reproduced from arXiv: 2605.04528 by Fuzhen Xuan, Yang Gao, Yue Hu, Zesen Wang, Zihao Wu.

**Figure 1.** Figure 1: YOTOnet model architecture. The model consists of three core components: view at source ↗

**Figure 2.** Figure 2: Detailed structure of the Invariant Feature Distiller. The module employs view at source ↗

**Figure 3.** Figure 3: Overall architecture of Domain-Conditioned Sparse Experts (DC–MoE). The view at source ↗

**Figure 4.** Figure 4: Gating network routing logic. The gate network transforms input features view at source ↗

**Figure 5.** Figure 5: Load balancing regularizer computation. The regularizer measures the squared view at source ↗

**Figure 6.** Figure 6: Bar-chart visualization of method performance across the five test sets. view at source ↗

**Figure 7.** Figure 7: Combined confusion matrices for representative zero-shot evaluations across test view at source ↗

read the original abstract

Mechanical equipment forms the critical backbone of modern industrial production, yet domain shift severely limits the generalization of deep learning based fault diagnosis models across different equipment and operating conditions.Inspired by the success of foundation models in achieving zero-shotgeneralization, we propose YOTOnet (You Only Train Once), a novel architecture specifically designed for cross-domain fault diagnosis in mechanical equipment.YOTOnet comprises three core components: (1) a physics-aware Invariant Feature Distiller that extracts domain-agnostic representations using multi-scale dilated convolutions and FFT-based time-frequency fusion,(2) Domain-Conditioned Sparse Experts (DC-MoE) that adaptively route inputs to specialized processors via learned gating without external meta-data, and (3) a dual-head classification system with auxiliary supervision.Extensive validation on five public bearing datasets (CWRU, MFPT, XJTU,OTTAWA, HUST) through 30 cross-dataset protocols demonstrates the superiority of YOTOnet compared with other state-of-the-art methods. Critically, we observe a clear scaling effect-average test F1 improves from 0.5339(1 training dataset) to 0.705 (4 datasets), with a clear gain when moving from 3 to 4 datasets. These findings provide empirical evidence that foundation model principles can enable robust, train-once deployment for industrial fault diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOTOnet reports scaling F1 gains on cross-domain bearing faults but its invariant distiller and domain MoE likely work against each other.

read the letter

The main point is that YOTOnet gets average test F1 up from 0.5339 with one training dataset to 0.705 with four, across 30 cross-dataset protocols on five public bearing sets. That scaling observation is the concrete result worth looking at. The architecture adds a physics-aware distiller using multi-scale dilated convolutions and FFT fusion to pull out domain-agnostic features, then feeds them to domain-conditioned sparse experts that route via learned gating with no extra metadata, plus a dual-head setup for supervision. This is a fresh combination aimed at zero-shot industrial fault diagnosis, and the empirical trend with more data is the part that stands out as useful for people thinking about foundation-model style training in this domain. It does a reasonable job of framing the practical problem of domain shift in mechanical equipment. The soft spots sit right at the center. If the distiller actually succeeds at removing domain information, the gating network has almost no domain-specific signal left to decide which expert to use, so the reported gains could just reflect more parameters and more training data rather than real adaptive routing. The abstract gives no expert activation statistics, no gating entropy breakdowns by source versus target domain, and no error bars or detailed baseline tables, which leaves the superiority claim thin. Dataset overlap or hyperparameter effects are also unaddressed. This is for readers working on signal-based fault diagnosis and domain adaptation in industrial settings. Someone already following MoE or invariant feature work might pick up the scaling pattern, but the paper needs tighter checks on whether the two main pieces actually cooperate. It deserves peer review so the full results and any missing analysis can be examined.

Referee Report

3 major / 2 minor

Summary. The paper proposes YOTOnet (You Only Train Once), a novel architecture for zero-shot cross-domain fault diagnosis of mechanical equipment. It consists of three main components: (1) a physics-aware Invariant Feature Distiller that uses multi-scale dilated convolutions and FFT-based time-frequency fusion to extract domain-agnostic representations, (2) Domain-Conditioned Sparse Experts (DC-MoE) that perform adaptive routing to specialized experts via learned gating without external metadata, and (3) a dual-head classification system with auxiliary supervision. The work validates the approach on five public bearing datasets (CWRU, MFPT, XJTU, OTTAWA, HUST) across 30 cross-dataset protocols, reporting that average test F1 improves from 0.5339 (1 training dataset) to 0.705 (4 datasets) and outperforms other state-of-the-art methods, providing empirical support for foundation-model-style scaling in this domain.

Significance. If the empirical scaling trend and cross-domain gains hold under rigorous scrutiny, the work would be significant for industrial fault diagnosis by demonstrating that a single model trained once can generalize across equipment and operating conditions without domain-specific retraining or metadata. The explicit scaling observation (F1 rising with additional datasets) and the combination of invariant feature extraction with sparse expert routing offer a concrete architectural template that could be extended to other sensor-based diagnostic tasks.

major comments (3)

[Methods (DC-MoE and Invariant Feature Distiller)] Methods (description of Invariant Feature Distiller and DC-MoE): The headline scaling result rests on the joint operation of the two mechanisms. If the Invariant Feature Distiller successfully produces domain-agnostic representations, the input to the learned gating network in DC-MoE necessarily contains little domain-discriminative signal, which directly undermines the premise that the gating can discover and exploit domain-specific expert specialization. The manuscript provides no analysis of expert activation statistics, gating entropy, or routing patterns conditioned on source versus target domains, leaving open whether observed gains arise from true adaptive routing or simply from increased parameter capacity and more training data.
[Experiments] Experimental results (30 cross-dataset protocols): The reported average F1 improvement from 0.5339 to 0.705 is presented without error bars, standard deviations across runs, or per-protocol breakdowns. This absence makes it impossible to determine whether the scaling trend is statistically reliable or sensitive to particular dataset pairings, which is load-bearing for the central claim of consistent superiority and foundation-model-like behavior.
[Experiments] Baseline comparisons: The abstract states superiority over state-of-the-art methods but supplies no concrete baseline F1 scores, number of compared methods, or details on hyperparameter tuning protocols. Without these, the magnitude of improvement cannot be assessed relative to standard practices, weakening the comparative claim that underpins the practical significance.

minor comments (2)

[Abstract and Experiments] The abstract and results sections would benefit from explicit statements of the number of experts, gating temperature, and auxiliary loss weights to allow reproducibility.
[Methods] Notation for the FFT-based fusion and multi-scale dilated convolutions should be defined with equations rather than descriptive text only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments on our work. We address each of the major comments point-by-point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Methods (DC-MoE and Invariant Feature Distiller)] Methods (description of Invariant Feature Distiller and DC-MoE): The headline scaling result rests on the joint operation of the two mechanisms. If the Invariant Feature Distiller successfully produces domain-agnostic representations, the input to the learned gating network in DC-MoE necessarily contains little domain-discriminative signal, which directly undermines the premise that the gating can discover and exploit domain-specific expert specialization. The manuscript provides no analysis of expert activation statistics, gating entropy, or routing patterns conditioned on source versus target domains, leaving open whether observed gains arise from true adaptive routing or simply from increased parameter capacity and more training data.

Authors: The referee raises a valid concern regarding the interplay between the two components. The Invariant Feature Distiller reduces domain-specific information to promote generalization, yet the DC-MoE is intended to allow specialization on residual domain or task variations through learned routing. We will revise the manuscript to include a detailed analysis of the expert activation statistics, gating entropy, and routing patterns for both source and target domains. This addition will provide evidence on whether the performance gains stem from adaptive expert specialization or other factors such as increased model capacity. revision: yes
Referee: [Experiments] Experimental results (30 cross-dataset protocols): The reported average F1 improvement from 0.5339 to 0.705 is presented without error bars, standard deviations across runs, or per-protocol breakdowns. This absence makes it impossible to determine whether the scaling trend is statistically reliable or sensitive to particular dataset pairings, which is load-bearing for the central claim of consistent superiority and foundation-model-like behavior.

Authors: We concur that error bars and per-protocol details are necessary to substantiate the scaling claims. In the revised manuscript, we will report the average F1 scores with standard deviations computed over multiple random seeds or runs. Additionally, we will provide a breakdown of results for each of the 30 cross-dataset protocols to allow readers to evaluate the consistency of the observed improvements. revision: yes
Referee: [Experiments] Baseline comparisons: The abstract states superiority over state-of-the-art methods but supplies no concrete baseline F1 scores, number of compared methods, or details on hyperparameter tuning protocols. Without these, the magnitude of improvement cannot be assessed relative to standard practices, weakening the comparative claim that underpins the practical significance.

Authors: The manuscript body contains the full set of baseline comparisons, including F1 scores for the compared methods and the hyperparameter search protocol. To improve clarity in the abstract, we will incorporate the number of baseline methods and representative F1 improvements. This will better support the claim of superiority while keeping the abstract concise. revision: partial

Circularity Check

0 steps flagged

No circularity; results are empirical validation on external public datasets

full rationale

The paper introduces an architecture (Invariant Feature Distiller + DC-MoE + dual-head classifier) and reports performance gains measured on held-out cross-dataset protocols from five independent public bearing datasets (CWRU, MFPT, XJTU, OTTAWA, HUST). No equations or derivations are presented that reduce by construction to fitted parameters or self-referential definitions; the scaling claim (F1 from 0.5339 to 0.705) is obtained by training on increasing numbers of source domains and testing on target domains outside the training distribution. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is used to substitute for independent evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only view prevents identification of specific fitted values or background assumptions; the model introduces new named components whose internal parameters (gating weights, convolution scales) function as free parameters.

invented entities (2)

Domain-Conditioned Sparse Experts (DC-MoE) no independent evidence
purpose: Adaptively route inputs to specialized processors via learned gating without external meta-data
Core novel component introduced to handle domain adaptation
physics-aware Invariant Feature Distiller no independent evidence
purpose: Extract domain-agnostic representations using multi-scale dilated convolutions and FFT-based time-frequency fusion
Core novel component introduced to achieve invariance

pith-pipeline@v0.9.0 · 5557 in / 1266 out tokens · 24589 ms · 2026-05-08T16:28:51.864710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 18 canonical work pages · 2 internal anchors

[1]

Bourassa, F

D. Bourassa, F. Gauthier, G. Abdulnour, Equipment failures and their contribution to industrial incidents and accidents in the manufacturing industry, International journal of occupational safety and ergonomics : JOSE 22 (2015) 1–23. doi:10.1080/10803548.2015.1116814

work page doi:10.1080/10803548.2015.1116814 2015
[2]

S. Qiu, X. Cui, Z. Ping, N. Shan, Z. Li, X. Bao, X. Xu, Deep learn- ing techniques in intelligent fault diagnosis and prognosis for industrial systems: A review, Sensors 23 (3) (2023). doi:10.3390/s23031305

work page doi:10.3390/s23031305 2023
[3]

J. Wang, C. Lan, C. Liu, Y. Ouyang, T. Qin, Generalizing to un- seen domains: A survey on domain generalization, arXiv preprint arXiv:2103.03097 (2022)

work page arXiv 2022
[4]

Jiang, Y

F. Jiang, Y. Kuang, T. Li, S. Zhang, Z. Wu, K. Feng, W. Li, To- wards enhanced interpretability: A mechanism-driven domain adap- tation model for bearing fault diagnosis across operating condi- 16 tions, Mechanical Systems and Signal Processing 225 (2025) 112244. doi:https://doi.org/10.1016/j.ymssp.2024.112244

work page doi:10.1016/j.ymssp.2024.112244 2025
[5]

X. Pan, H. Chen, W. Wang, X. Su, Adversarial domain adap- tation based on contrastive learning for bearings fault diagno- sis, Simulation Modelling Practice and Theory 139 (2025) 103058. doi:https://doi.org/10.1016/j.simpat.2024.103058

work page doi:10.1016/j.simpat.2024.103058 2025
[6]

Z. Yang, L. Luo, J. Ma, H. Zhang, L. Yang, Z. Wu, Enhancing bearing fault diagnosis in real damages: A hybrid multidomain generalization network for feature comparison, IEEE Transactions on Instrumentation and Measurement 74 (2025) 1–11. doi:10.1109/TIM.2025.3556828

work page doi:10.1109/tim.2025.3556828 2025
[7]

X. Cui, H. Zhan, K. Han, J. Yu, R. Wang, G. Huang, Multi-bearing fault diagnosis method based on convolutional autoencoder causal de- coupling domain generalization, ISA Transactions 163 (2025) 236–250. doi:https://doi.org/10.1016/j.isatra.2025.05.008

work page doi:10.1016/j.isatra.2025.05.008 2025
[8]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, R. Girshick, Segment anything (2023). arXiv:2304.02643

work page internal anchor Pith review arXiv 2023
[9]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, P. Schwaller, Chemcrow: Augmenting large-language models with chem- istry tools (2023). arXiv:2304.05376

work page internal anchor Pith review arXiv 2023
[10]

Dalla-Torre, L

H. Dalla-Torre, L. Gonzalez, J. M. Revilla, N. L. Carranza, A. H. Grzywaczewski, F. Oteri, C. Dallago, E. Trop, H. Sirelkhatim, G. Richard, M. Skwark, K. Beguir, M. Lopez, T. Pierrot, The nucleotide transformer: Building and evaluating robust foundation models for hu- man genomics, bioRxiv (2023). doi:10.1101/2023.01.11.523679

work page doi:10.1101/2023.01.11.523679 2023
[11]

H. A. A. M. Qaid, B. Zhang, D. Li, S.-K. Ng, W. Li, Fd-llm: Large languagemodelforfaultdiagnosisofmachines(2024). arXiv:2412.01218

work page arXiv 2024
[12]

Y. Wang, Y. Yu, K. Sun, P. Lei, Y. Zhang, E. Zio, A. Xia, Y. Li, Rmgpt: A foundation model with generative pre-trained transformer for fault diagnosis and prognosis in rotating machinery (2025). arXiv:2409.17604

work page arXiv 2025
[13]

X. Chen, Y. Lei, Y. Li, S. Parkinson, X. Li, J. Liu, F. Lu, H. Wang, Z. Wang, B. Yang, S. Ye, Z. Zhao, Large models for machine monitoring and fault diagnostics: Opportunities, challenges, and future direction, Journal of Dynamics, Monitoring and Diagnostics 4 (2) (2025) 76–90. doi:10.37965/jdmd.2025.832

work page doi:10.37965/jdmd.2025.832 2025
[14]

Szegedy, W

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- han, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: 17 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

2015
[15]

F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolu- tions, arXiv preprint arXiv:1511.07122 (2016)

work page Pith review arXiv 2016
[16]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog- nition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[17]

J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2018

2018
[18]

C. W. R. U. B. D. Center, Case western reserve university bearing data center dataset (cwru), accessed 2025-11-01 (2019)

2025
[19]

X. J. University, S. U. of Technology, Xjtu-sy bearing accelerated life test dataset, accessed 2025-11-01 (2018)

2025
[20]

Sehri, U

A. Sehri, U. of Ottawa, University of ottawa bearing dataset, accessed 2025-11-01 (2023)

2025
[21]

R. Liu, B. Yang, E. Zio, X. Chen, Artificial intelligence for fault diagnosis of rotating machinery: A review, Me- chanical Systems and Signal Processing 108 (2018) 33–47. doi:https://doi.org/10.1016/j.ymssp.2018.02.016

work page doi:10.1016/j.ymssp.2018.02.016 2018
[22]

Domain-Adversarial Training of Neural Networks

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio- lette, M. Marchand, V. Lempitsky, Domain-adversarial training of neu- ral networks (2016). arXiv:1505.07818. URLhttps://arxiv.org/abs/1505.07818

work page Pith review arXiv 2016
[23]

B. Sun, K. Saenko, Deep coral: Correlation alignment for deep domain adaptation (2016). arXiv:1607.01719. URLhttps://arxiv.org/abs/1607.01719

work page Pith review arXiv 2016
[24]

K. Zhou, Y. Yang, Y. Qiao, T. Xiang, Domain generalization with mixstyle (2021). arXiv:2104.02008. URLhttps://arxiv.org/abs/2104.02008 18

work page arXiv 2021