MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Karthik Nandakumar; Nurbek Tastan; Samuel Horvath; Stefanos Laskaridis

arxiv: 2602.06154 · v2 · pith:3WIAVEYXnew · submitted 2026-02-05 · 💻 cs.LG · cs.CL

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Nurbek Tastan , Stefanos Laskaridis , Karthik Nandakumar , Samuel Horvath This is my paper

classification 💻 cs.LG cs.CL

keywords expertsmoseexpertmodelsslimmableinferencemodeltraining

0 comments

read the original abstract

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT-style models, various routing regimes, zero-shot downstream reasoning benchmarks, and continual pre-training adaptation of DeepSeek model show that MoSE matches or improves standard MoE at full width and consistently shifts the compute-quality frontier toward lower inference FLOPs. The code can be found at: https://github.com/tnurbek/mose.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning
stat.ML 2026-05 unverdicted novelty 7.0

InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRP...
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
cs.CL 2026-05 unverdicted novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.