arxiv: 2602.12708 · v2 · submitted 2026-02-13 · 💻 cs.LG

Recognition: no theorem link

Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

Jon Irureta , Gorka Azkune , Jon Imaz , Aizea Lojo , Javier Fernandez-Marques

Authors on Pith no claims yet

Pith reviewed 2026-05-16 01:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords vertical federated learningmixture of expertssplit learningsample misalignmentpretrained encodersfederated robustnessmodel interpretability

0 comments

The pith

Split-MoPE routes partially aligned samples through fixed experts and pretrained encoders to finish vertical federated training in one round.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vertical federated learning often discards samples or runs many communication rounds because participant datasets rarely cover the same records. Split-MoPE solves this by matching each sample to a predefined expert that processes exactly the available data views, using domain-specific pretrained encoders to produce features locally. The model then aggregates those features centrally after a single exchange. A sympathetic reader would care because the method claims both higher accuracy than prior misalignment fixes and built-in resistance to bad participants plus per-sample contribution scores.

Core claim

Split-MoPE integrates split learning with a Mixture of Predefined Experts architecture in which experts are statically assigned to specific patterns of data availability across parties. Pretrained encoders handle local feature extraction for each domain, allowing the system to use every sample that has at least one contributing party. This produces state-of-the-art accuracy on CIFAR-10/100 and Breast Cancer Wisconsin after one communication round, while adding robustness to noisy or malicious participants and per-sample interpretability through contribution quantification.

What carries the argument

Mixture of Predefined Experts (MoPE) that assigns fixed experts to concrete data-alignment patterns to maximize usable samples per batch.

If this is right

Training and inference require only one round of communication between parties.
Performance exceeds LASER and Vertical SplitNN on vision and tabular tasks with high rates of missing data.
Each prediction yields an explicit score for every participant's contribution.
The architecture tolerates malicious or noisy parties without extra defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-expert pattern could be tried in cross-silo settings where domain-specific encoders are already deployed.
A hybrid version that occasionally learns new expert assignments might handle evolving data distributions.
The per-sample contribution scores may simplify audits required for regulated collaborative training.

Load-bearing premise

Effective pretrained encoders must already exist for every data domain held by the participants.

What would settle it

Replace the pretrained encoders with randomly initialized ones on the same CIFAR and tabular benchmarks and check whether the single-round accuracy advantage over LASER and Vertical SplitNN vanishes.

Figures

Figures reproduced from arXiv: 2602.12708 by Aizea Lojo, Gorka Azkune, Javier Fernandez-Marques, Jon Imaz, Jon Irureta.

**Figure 2.** Figure 2: (a) Classical VFL data partition, where the sample [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Real data partition of VFL where samples among [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Our proposed Split-MoPE, combining Split learning [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Representation of the forward pass in MoPE. The [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Evolution of MoPE’s performance as 𝑝𝑚𝑖𝑠𝑠 increases in CIFAR10 (a) and Breast Cancer Wisconsin (b). Split-MoPE is robust against noisy or malicious parties: To analyze the effect of noisy or malicious parties, we ran extensive [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: Visual representation of the suggested process to [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 7.** Figure 7: Results with multiple noisy participants varying [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Vertical Federated Learning (VFL) has emerged as a critical paradigm for collaborative model training in privacy-sensitive domains such as finance and healthcare. However, most existing VFL frameworks rely on the idealized assumption of full sample alignment across participants, a premise that rarely holds in real-world scenarios. To bridge this gap, this work introduces Split-MoPE, a novel framework that integrates Split Learning with a specialized Mixture of Predefined Experts (MoPE) architecture. Unlike standard Mixture of Experts (MoE), where routing is learned dynamically, MoPE uses predefined experts to process specific data alignments, effectively maximizing data usage during both training and inference without requiring full sample overlap. By leveraging pretrained encoders for target data domains, Split-MoPE achieves state-of-the-art performance in a single communication round, significantly reducing the communication footprint compared to multi-round end-to-end training. Furthermore, unlike existing proposals that address sample misalignment, this novel architecture provides inherent robustness against malicious or noisy participants and offers per-sample interpretability by quantifying each collaborator's contribution to each prediction. Extensive evaluations on vision (CIFAR-10/100) and tabular (Breast Cancer Wisconsin) datasets demonstrate that Split-MoPE consistently outperforms state-of-the-art systems such as LASER and Vertical SplitNN, particularly in challenging scenarios with high data missingness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Split-MoPE combines split learning with fixed expert routing to handle partial sample overlaps in VFL and claims single-round gains, but those gains look tied to the quality of external pretrained encoders.

read the letter

The core idea here is using a mixture of predefined experts on top of split learning so that parties can still train when their samples only partially overlap. Instead of learning a router, they assign experts to specific alignment patterns upfront. This lets them pull in more data during training and inference without forcing full alignment, and they report it runs in one communication round while adding some robustness to bad participants and per-sample contribution scores. That combination is new relative to the LASER and Vertical SplitNN baselines they cite. The framing of the misalignment problem is clear and matches real constraints in finance and healthcare VFL, and the single-round protocol is a practical selling point for lowering communication cost. The interpretability angle on contribution per prediction is a nice side benefit if it holds up. The main soft spot is the heavy reliance on pretrained encoders for each domain. The abstract ties the SOTA numbers and robustness directly to those encoders, yet there is no sign of an ablation that swaps them for random initialization or joint training. Without that, it is hard to separate what the MoPE routing actually contributes from the strength of the external pretraining. The evaluation claims are also thin on details—no numbers, error bars, or exact misalignment rates appear in the abstract—so the full paper needs to show those controls clearly. This is worth a serious referee for groups working on deployable VFL. The architecture is straightforward and targets a genuine gap, even if the experiments will need tightening on the encoder dependence and quantitative reporting.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces Split-MoPE, a framework that combines Split Learning with a Mixture of Predefined Experts (MoPE) architecture for Vertical Federated Learning. It addresses sample misalignment by using predefined experts for specific data alignments, claims single-round state-of-the-art performance on CIFAR-10/100 and Breast Cancer Wisconsin datasets via pretrained encoders, and asserts reduced communication, inherent robustness to malicious/noisy participants, and per-sample interpretability through contribution quantification.

Significance. If the single-round performance, robustness, and interpretability claims hold after proper ablation, the work could meaningfully advance practical VFL in privacy-sensitive settings by reducing communication rounds and handling real-world misalignment. The use of predefined experts to maximize data usage without full overlap is a potentially useful idea, though its novelty relative to existing MoE variants in federated settings requires clearer positioning.

major comments (3)

[Abstract] Abstract: The claim of 'consistent outperformance' and 'state-of-the-art performance in a single communication round' is presented without any quantitative results, error bars, specific accuracy numbers, or ablation details, making it impossible to assess whether the gains are attributable to the MoPE architecture or to the pretrained encoders.
[Abstract] Abstract and evaluation description: The performance, robustness, and interpretability claims are explicitly conditioned on 'leveraging pretrained encoders for target data domains,' yet no ablation is described that replaces these encoders with randomly initialized or jointly trained alternatives to isolate the contribution of the MoPE routing and single-round protocol.
[Abstract] The manuscript provides no equations, derivations, or formal description of the MoPE routing mechanism, the predefined expert assignment, or how per-sample contribution quantification is computed, leaving the central algorithmic claims without mathematical grounding.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each of the major comments point by point below. Where revisions are needed, we have indicated that changes will be made in the next version of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of 'consistent outperformance' and 'state-of-the-art performance in a single communication round' is presented without any quantitative results, error bars, specific accuracy numbers, or ablation details, making it impossible to assess whether the gains are attributable to the MoPE architecture or to the pretrained encoders.

Authors: The abstract serves as a concise summary of the key claims. Detailed quantitative results, including specific accuracy numbers with error bars from multiple runs, are provided in the experimental section of the manuscript (e.g., Tables 1-3). These results demonstrate the outperformance on the mentioned datasets. To improve the abstract's informativeness and allow readers to better assess the claims, we will revise it to include representative quantitative results. revision: yes
Referee: [Abstract] Abstract and evaluation description: The performance, robustness, and interpretability claims are explicitly conditioned on 'leveraging pretrained encoders for target data domains,' yet no ablation is described that replaces these encoders with randomly initialized or jointly trained alternatives to isolate the contribution of the MoPE routing and single-round protocol.

Authors: We recognize the value of isolating the MoPE contribution from the pretrained encoders. The current evaluation focuses on the practical setting with pretrained encoders, as this aligns with real-world VFL applications where domain-specific pretrained models are available. However, to address this concern, we will add new ablation experiments in the revised manuscript that include comparisons with randomly initialized encoders and end-to-end trained alternatives. This will help quantify the specific benefits of the MoPE routing and single-round protocol. revision: yes
Referee: [Abstract] The manuscript provides no equations, derivations, or formal description of the MoPE routing mechanism, the predefined expert assignment, or how per-sample contribution quantification is computed, leaving the central algorithmic claims without mathematical grounding.

Authors: We appreciate this feedback on the presentation. The manuscript does describe the MoPE architecture and routing in prose in Section 3, along with the contribution quantification method. To provide stronger mathematical grounding, we will include formal equations and a derivation for the routing mechanism, expert assignment, and per-sample contribution scores in the revised manuscript. This will make the algorithmic contributions more precise and easier to follow. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of useful pretrained encoders for each data domain and on the assumption that a fixed expert bank can be pre-assigned to alignment patterns without introducing new fitting parameters. No explicit free parameters or invented physical entities are stated.

axioms (1)

domain assumption Pretrained encoders for target data domains are available and effective for the downstream task.
Invoked to justify single-round training and performance claims.

invented entities (1)

Mixture of Predefined Experts (MoPE) routing no independent evidence
purpose: Handle specific data-alignment patterns without learned dynamic routing
New architectural component introduced to maximize data usage under misalignment.

pith-pipeline@v0.9.0 · 5545 in / 1321 out tokens · 67310 ms · 2026-05-16T01:48:16.204296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 7 internal anchors

[1]

Kingma DP Ba J Adam et al. 2014. A method for stochastic optimization.arXiv preprint arXiv:1412.69801412, 6 (2014). Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

work page arXiv 2014
[2]

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang

work page
[3]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering(2025)

work page 2025
[4]

Iker Ceballos, Vivek Sharma, Eduardo Mugica, Abhishek Singh, Alberto Ro- man, Praneeth Vepakomma, and Ramesh Raskar. 2020. Splitnn-driven vertical partitioning.arXiv preprint arXiv:2008.04137(2020)

work page arXiv 2020
[5]

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. 2022. On the repre- sentation collapse of sparse mixture of experts.Advances in Neural Information Processing Systems35 (2022), 34600–34613

work page 2022
[6]

Róbert Csordás, Kazuki Irie, and Jürgen Schmidhuber. 2023. Approximating Two-Layer Feedforward Networks for Efficient Transformers. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Compu- tational Linguistics

work page 2023
[7]

Yue Cui, Chung-ju Huang, Yuzhu Zhang, Leye Wang, Lixin Fan, Xiaofang Zhou, and Qiang Yang. 2024. A survey on contribution evaluation in vertical federated learning.arXiv preprint arXiv:2405.02364(2024)

work page arXiv 2024
[8]

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. 2023. On the benefits of learning to route in mixture-of-experts models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 9376–9396

work page 2023
[10]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

work page 2022
[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

work page 2016
[12]

Yuanqin He, Yan Kang, Xinyuan Zhao, Jiahuan Luo, Lixin Fan, Yuxing Han, and Qiang Yang. 2024. A hybrid self-supervised learning framework for vertical federated learning.IEEE Transactions on Big Data(2024)

work page 2024
[13]

Chung-ju Huang, Leye Wang, and Xiao Han. 2023. Vertical federated knowledge transfer via representation distillation for healthcare collaboration networks. In Proceedings of the ACM Web Conference 2023. 4188–4199

work page 2023
[14]

Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng. 2024. Harder tasks need more experts: Dynamic routing in moe models.arXiv preprint arXiv:2403.07652(2024)

work page arXiv 2024
[15]

Jon Irureta, Jon Imaz, Aizea Lojo, Javier Fernandez-Marques, Marco González, and Iñigo Perona. 2024. Towards Active Participant Centric Vertical Feder- ated Learning: Some Representations May Be All You Need.arXiv preprint arXiv:2410.17648(2024)

work page arXiv 2024
[16]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

work page 1991
[17]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. 2024. Moe++: Accelerating mixture-of-experts methods with zero-computation experts.arXiv preprint arXiv:2410.07348(2024)

work page arXiv 2024
[19]

Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm.Neural computation6, 2 (1994), 181–214

work page 1994
[20]

Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cor- mode, Rachel Cummings, et al. 2021. Advances and open problems in federated learning.Foundations and trends®in machine learning14, 1–2 (2021), 1–210

work page 2021
[21]

Yan Kang, Yang Liu, and Xinle Liang. 2022. Fedcvt: Semi-supervised vertical federated learning with cross-view training.ACM Transactions on Intelligent Systems and Technology (TIST)13, 4 (2022), 1–16

work page 2022
[22]

Afsana Khan, Marijn Ten Thij, Guangzhi Tang, and Anna Wilbik. 2025. VFL-RPS: Relevant Participant Selection in Vertical Federated Learning. In2025 Interna- tional Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

work page 2025
[23]

Afsana Khan, Marijn ten Thij, and Anna Wilbik. 2025. Vertical federated learning: A structured literature review.Knowledge and Information Systems(2025), 1–39

work page 2025
[24]

Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images.(2009)

work page 2009
[25]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Songze Li, Duanyi Yao, and Jin Liu. 2023. FedVS: Straggler-resilient and privacy- preserving vertical federated learning for split models. InInternational conference on machine learning. PMLR, 20296–20311

work page 2023
[27]

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems2 (2020), 429–450

work page 2020
[28]

Yang Liu, Yan Kang, Tianyuan Zou, Yanhong Pu, Yuanqin He, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang, and Qiang Yang. 2024. Vertical federated learning: Concepts, advances, and challenges.IEEE transactions on knowledge and data engineering36, 7 (2024), 3615–3634

work page 2024
[29]

Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, and Yang You. 2021. Sparse- mlp: A fully-mlp architecture with conditional computation.arXiv preprint arXiv:2109.020081 (2021), 12

work page arXiv 2021
[30]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data. InArtificial intelligence and statistics. PMLR, 1273–1282

work page 2017
[31]

Daniel Morales, Isaac Agudo, and Javier Lopez. 2023. Private set intersection: A systematic literature review.Computer Science Review49 (2023), 100567

work page 2023
[32]

Xiaonan Nie, Shijie Cao, Xupeng Miao, Lingxiao Ma, Jilong Xue, Youshan Miao, Zichao Yang, Zhi Yang, and Bin Cui. 2021. Dense-to-sparse gate for mixture-of- experts. (2021)

work page 2021
[33]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečn`y, Sanjiv Kumar, and H Brendan McMahan. 2020. Adaptive feder- ated optimization.arXiv preprint arXiv:2003.00295(2020)

work page arXiv 2020
[35]

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. 2021. Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems34 (2021), 8583–8595

work page 2021
[36]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538 [cs.LG] https: //arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Jingwei Sun, Ziyue Xu, Dong Yang, Vishwesh Nath, Wenqi Li, Can Zhao, Daguang Xu, Yiran Chen, and Holger R Roth. 2023. Communication-efficient vertical fed- erated learning with limited overlapping samples. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5203–5212

work page 2023
[38]

Pedro Valdeira, Shiqiang Wang, and Yuejie Chi. 2024. Vertical federated learning with missing features during training and inference.arXiv preprint arXiv:2410.22564(2024)

work page arXiv 2024
[39]

Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. 2018. Split learning for health: Distributed deep learning without sharing raw patient data.arXiv preprint arXiv:1812.00564(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Jie Wen, Zhixia Zhang, Yang Lan, Zhihua Cui, Jianghui Cai, and Wensheng Zhang

work page
[41]

A survey on federated learning: challenges and applications.International journal of machine learning and cybernetics14, 2 (2023), 513–535

work page 2023
[42]

William Wolberg, Olvi Mangasarian, Nick Street, and W. Street. 1993. Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5DW2B

work page doi:10.24432/c5dw2b 1993
[43]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al . 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025). A Models, Hyperparameters and Datasets Architectural details:In Table 6, we summarize the main ch...

work page internal anchor Pith review Pith/arXiv arXiv 2025