MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Weisen Jiang , Shuhao Chen , Sinno Jialin Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CR

keywords datametamoeexpertprivacy-preservingproxyselectiontrainingunification

0 comments

The pith

MetaMoE unifies domain-specialized experts into a single MoE via diversity-aware public proxy selection that approximates private data distributions for router training and expert alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The core problem is that Mixture-of-Experts models need a router to pick the right specialist for each input, but training that router normally requires seeing all the data together. When data lives on separate clients and cannot be shared for privacy reasons, each client can still train its own expert locally. MetaMoE then picks a small set of public data samples that are both relevant to each client's domain and spread out in diversity. These public samples stand in for the missing private data to train the router and to make the separate experts coordinate better when they are later combined. A context-aware router further helps pick experts for new inputs. The authors test the method on standard image classification and language modeling benchmarks and report that it beats other recent approaches that also try to build unified MoE models without moving private data. The code is released publicly.

Core claim

Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods.

Load-bearing premise

Public proxy data selected for domain relevance and diversity can sufficiently approximate inaccessible private data distributions to supervise router learning and expert alignment without introducing large distribution shift.

Figures

Figures reproduced from arXiv: 2605.14289 by Shuhao Chen, Sinno Jialin Pan, Weisen Jiang.

**Figure 2.** Figure 2: t-SNE visualization of selected proxy samples with random selection, FlexOlmo selection (relevance only), and our selection (relevance + diversity) for Pets with ViT-B/32 as the seed model. As can be seen, our selection yields a more diverse and representative proxy dataset that covers the private-data manifold more effectively (see Section 4.3 for further analysis). 3.2. Proxy Data Selection via Relevance… view at source ↗

**Figure 3.** Figure 3: Comparison of MetaMoE with training solely on proxy data in the CV setting with CLIP ViT-B/16. texts. Consequently, our MetaMoE provides a stronger and more aligned proxy of the unavailable private data, which translates into consistent performance gains ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: presents randomly sampled images from the three client datasets used in the CV experiments: Pets (Parkhi et al., 2012), Flowers (Nilsback & Zisserman, 2008), and EuroSAT (Helber et al., 2019). These examples illustrate the visual diversity across domains, ranging from fine-grained object recognition of dog and cat breeds (Pets), to natural scene categorization of flower species (Flowers), and remote sensin… view at source ↗

**Figure 5.** Figure 5: Sample images from ImageNet. D. Natural Language Processing Datasets The client-side NLP datasets comprise CommonsenseQA (Talmor et al., 2019), CosmosQA (Huang et al., 2019), and SocialIQA (Sap et al., 2019). These cover complementary reasoning skills: CommonsenseQA requires grounding abstract questions in everyday knowledge; CosmosQA emphasizes multi-sentence comprehension with causal and temporal reason… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the method builds on standard MoE routing and proxy-data ideas without detailing new postulates.

pith-pipeline@v0.9.0 · 5480 in / 1078 out tokens · 115681 ms · 2026-05-15T02:21:07.014636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Mixture-of-LoRAs: An efficient multitask tuning for large language models

Feng, W., Hao, C., Zhang, Y ., Han, Y ., and Wang, H. Mixture-of-LoRAs: An efficient multitask tuning for large language models. Preprint arXiv:2403.03432,

work page arXiv
[2]

Branch-train-merge: Embarrassingly parallel training of expert language models.arXiv preprint arXiv:2208.03306, 2022

Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. Branch-Train-Merge: Embarrassingly parallel training of expert language mod- els. Preprint arXiv:2208.03306,

work page arXiv
[3]

W., Tay, Y ., Zhou, D., Le, Q

Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y ., Zhou, D., Le, Q. V ., Zoph, B., Wei, J., and Roberts, A. Designing data and methods for effective instruction tuning. Preprint arXiv:2301.13688,

work page arXiv
[4]

arXiv preprint arXiv:2408.07666 , year=

Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in LLMs, MLLMs, and be- yond: Methods, theories, applications and opportunities. Preprint arXiv:2408.07666,

work page arXiv
[5]

12 Title Suppressed Due to Excessive Size A. Computation of Relevance Score Following FlexOlmo (Shi et al., 2025), we compute the relevance score g(x,D p) of a public sample x∈ D 0 with respect to a client dataset Dp by training a binary classifier to distinguish Dp from D0. Specifically, we construct a training set by labeling samples from Dp as positive...

work page 2025
[6]

Over all n candidates and m greedy steps, the total cost for each greedy step scales asO(nm 3), which is infeasible when bothnandmare large

matrix, incurring O(m3) time per evaluation. Over all n candidates and m greedy steps, the total cost for each greedy step scales asO(nm 3), which is infeasible when bothnandmare large. Cholesky Factorization.To avoid redundant recomputation, we exploit the Cholesky decomposition (Horn & Johnson, 1985). Suppose the current kernel submatrix admits a decomp...

work page 1985
[7]

Both steps are efficient: solving a triangular system costs O(m), and computing a norm is linear in m as well

13 Title Suppressed Due to Excessive Size Thus, the update reduces to solving a triangular system (for y) and computing a residual variance (for σ2). Both steps are efficient: solving a triangular system costs O(m), and computing a norm is linear in m as well. Hence, each iteration of greedy MAP inference only requires a total time of O(nm) for searching ...

work page 2012
[8]

Figure 5 shows randomly sampled examples from ImageNet

as the public dataset from which proxy samples are drawn. Figure 5 shows randomly sampled examples from ImageNet. (a) Pets. (b) Flowers. (c) EuroSAT. Figure 4.Sample images from the three client domains: Pets, Flowers, and EuroSAT. Figure 5.Sample images from ImageNet. D. Natural Language Processing Datasets The client-side NLP datasets comprise Commonsen...

work page 2019
[9]

This leads to substantial bandwidth and memory overhead and creates instability when client data are heterogeneous because divergent local updates must be averaged

Federated learning (FL) methods differ fundamentally from MetaMoE: they require exchanging large model states every round. This leads to substantial bandwidth and memory overhead and creates instability when client data are heterogeneous because divergent local updates must be averaged. MetaMoE eliminates synchronization entirely—each client fine-tunes it...

work page 2025
[10]

This approach incurs high communication and memory costs and often becomes unstable under heterogeneous client data, leading to degraded performance

adopts a federated learning paradigm that requires repeated synchronization among clients—periodically exchanging model parameters for joint optimization. This approach incurs high communication and memory costs and often becomes unstable under heterogeneous client data, leading to degraded performance. In contrast, MetaMoE avoid exchanging model paramete...

work page 2024
[11]

Table 13.Cost–performance comparison on CV tasks. ViT-B/32 ViT-B/16 ACC Unify Time (s) Inference Speed (samples/s) ACC Unify Time (s) Inference Speed (samples/s) BTM 90.33 − 606 91.75 − 249 ModelSoup 74.20 5.72 1813 79.42 5.72 743 BTX 74.30 11.13 1758 81.20 19.72 715 FlexOlmo 92.92 11.93 1767 93.53 18.24 719 MetaMoE 94.52 12.15 1751 96.24 19.88 710 Table ...

work page 2014
[12]

FlexOlmo initializes its domain-informed router using per-expert routing embeddings computed as the mean embedding overprivate data alone(Section 3.3.2 of Shi et al

through the routing mechanism. FlexOlmo initializes its domain-informed router using per-expert routing embeddings computed as the mean embedding overprivate data alone(Section 3.3.2 of Shi et al. (2025)). Concretely, FlexOlmo shares the vector µpriv = 1 N PN i=1 f(x i), which is the complete mean private embedding. In contrast, MetaMoE shares e= N N+m µp...

work page 2025