pith. machine review for the scientific record. sign in

arxiv: 2605.14289 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL· cs.CR

Recognition: no theorem link

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CR
keywords datametamoeexpertprivacy-preservingproxyselectiontrainingunification
0
0 comments X

The pith

MetaMoE unifies domain-specialized experts into a single MoE via diversity-aware public proxy selection that approximates private data distributions for router training and expert alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The core problem is that Mixture-of-Experts models need a router to pick the right specialist for each input, but training that router normally requires seeing all the data together. When data lives on separate clients and cannot be shared for privacy reasons, each client can still train its own expert locally. MetaMoE then picks a small set of public data samples that are both relevant to each client's domain and spread out in diversity. These public samples stand in for the missing private data to train the router and to make the separate experts coordinate better when they are later combined. A context-aware router further helps pick experts for new inputs. The authors test the method on standard image classification and language modeling benchmarks and report that it beats other recent approaches that also try to build unified MoE models without moving private data. The code is released publicly.

Core claim

Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods.

Load-bearing premise

Public proxy data selected for domain relevance and diversity can sufficiently approximate inaccessible private data distributions to supervise router learning and expert alignment without introducing large distribution shift.

Figures

Figures reproduced from arXiv: 2605.14289 by Shuhao Chen, Sinno Jialin Pan, Weisen Jiang.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE visualization of selected proxy samples with random selection, FlexOlmo selection (relevance only), and our selection (relevance + diversity) for Pets with ViT-B/32 as the seed model. As can be seen, our selection yields a more diverse and representative proxy dataset that covers the private-data manifold more effectively (see Section 4.3 for further analysis). 3.2. Proxy Data Selection via Relevance… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of MetaMoE with training solely on proxy data in the CV setting with CLIP ViT-B/16. texts. Consequently, our MetaMoE provides a stronger and more aligned proxy of the unavailable private data, which translates into consistent performance gains ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: presents randomly sampled images from the three client datasets used in the CV experiments: Pets (Parkhi et al., 2012), Flowers (Nilsback & Zisserman, 2008), and EuroSAT (Helber et al., 2019). These examples illustrate the visual diversity across domains, ranging from fine-grained object recognition of dog and cat breeds (Pets), to natural scene categorization of flower species (Flowers), and remote sensin… view at source ↗
Figure 5
Figure 5. Figure 5: Sample images from ImageNet. D. Natural Language Processing Datasets The client-side NLP datasets comprise CommonsenseQA (Talmor et al., 2019), CosmosQA (Huang et al., 2019), and SocialIQA (Sap et al., 2019). These cover complementary reasoning skills: CommonsenseQA requires grounding abstract questions in everyday knowledge; CosmosQA emphasizes multi-sentence comprehension with causal and temporal reason￾… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the method builds on standard MoE routing and proxy-data ideas without detailing new postulates.

pith-pipeline@v0.9.0 · 5480 in / 1078 out tokens · 115681 ms · 2026-05-15T02:21:07.014636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Mixture-of-LoRAs: An efficient multitask tuning for large language models

    Feng, W., Hao, C., Zhang, Y ., Han, Y ., and Wang, H. Mixture-of-LoRAs: An efficient multitask tuning for large language models. Preprint arXiv:2403.03432,

  2. [2]

    Branch-train-merge: Embarrassingly parallel training of expert language models.arXiv preprint arXiv:2208.03306, 2022

    Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. Branch-Train-Merge: Embarrassingly parallel training of expert language mod- els. Preprint arXiv:2208.03306,

  3. [3]

    W., Tay, Y ., Zhou, D., Le, Q

    Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y ., Zhou, D., Le, Q. V ., Zoph, B., Wei, J., and Roberts, A. Designing data and methods for effective instruction tuning. Preprint arXiv:2301.13688,

  4. [4]

    arXiv preprint arXiv:2408.07666 , year=

    Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in LLMs, MLLMs, and be- yond: Methods, theories, applications and opportunities. Preprint arXiv:2408.07666,

  5. [5]

    12 Title Suppressed Due to Excessive Size A. Computation of Relevance Score Following FlexOlmo (Shi et al., 2025), we compute the relevance score g(x,D p) of a public sample x∈ D 0 with respect to a client dataset Dp by training a binary classifier to distinguish Dp from D0. Specifically, we construct a training set by labeling samples from Dp as positive...

  6. [6]

    Over all n candidates and m greedy steps, the total cost for each greedy step scales asO(nm 3), which is infeasible when bothnandmare large

    matrix, incurring O(m3) time per evaluation. Over all n candidates and m greedy steps, the total cost for each greedy step scales asO(nm 3), which is infeasible when bothnandmare large. Cholesky Factorization.To avoid redundant recomputation, we exploit the Cholesky decomposition (Horn & Johnson, 1985). Suppose the current kernel submatrix admits a decomp...

  7. [7]

    Both steps are efficient: solving a triangular system costs O(m), and computing a norm is linear in m as well

    13 Title Suppressed Due to Excessive Size Thus, the update reduces to solving a triangular system (for y) and computing a residual variance (for σ2). Both steps are efficient: solving a triangular system costs O(m), and computing a norm is linear in m as well. Hence, each iteration of greedy MAP inference only requires a total time of O(nm) for searching ...

  8. [8]

    Figure 5 shows randomly sampled examples from ImageNet

    as the public dataset from which proxy samples are drawn. Figure 5 shows randomly sampled examples from ImageNet. (a) Pets. (b) Flowers. (c) EuroSAT. Figure 4.Sample images from the three client domains: Pets, Flowers, and EuroSAT. Figure 5.Sample images from ImageNet. D. Natural Language Processing Datasets The client-side NLP datasets comprise Commonsen...

  9. [9]

    This leads to substantial bandwidth and memory overhead and creates instability when client data are heterogeneous because divergent local updates must be averaged

    Federated learning (FL) methods differ fundamentally from MetaMoE: they require exchanging large model states every round. This leads to substantial bandwidth and memory overhead and creates instability when client data are heterogeneous because divergent local updates must be averaged. MetaMoE eliminates synchronization entirely—each client fine-tunes it...

  10. [10]

    This approach incurs high communication and memory costs and often becomes unstable under heterogeneous client data, leading to degraded performance

    adopts a federated learning paradigm that requires repeated synchronization among clients—periodically exchanging model parameters for joint optimization. This approach incurs high communication and memory costs and often becomes unstable under heterogeneous client data, leading to degraded performance. In contrast, MetaMoE avoid exchanging model paramete...

  11. [11]

    Table 13.Cost–performance comparison on CV tasks. ViT-B/32 ViT-B/16 ACC Unify Time (s) Inference Speed (samples/s) ACC Unify Time (s) Inference Speed (samples/s) BTM 90.33 − 606 91.75 − 249 ModelSoup 74.20 5.72 1813 79.42 5.72 743 BTX 74.30 11.13 1758 81.20 19.72 715 FlexOlmo 92.92 11.93 1767 93.53 18.24 719 MetaMoE 94.52 12.15 1751 96.24 19.88 710 Table ...

  12. [12]

    FlexOlmo initializes its domain-informed router using per-expert routing embeddings computed as the mean embedding overprivate data alone(Section 3.3.2 of Shi et al

    through the routing mechanism. FlexOlmo initializes its domain-informed router using per-expert routing embeddings computed as the mean embedding overprivate data alone(Section 3.3.2 of Shi et al. (2025)). Concretely, FlexOlmo shares the vector µpriv = 1 N PN i=1 f(x i), which is the complete mean private embedding. In contrast, MetaMoE shares e= N N+m µp...