Boundary mass in MoE is linear in slab width under smoothness and transversality, so the zero-temperature limit is governed by a thin geometric layer around routing interfaces rather than the full input space.
citation dossier
Jordan and Robert A
why this work matters in Pith
Pith has found this work in 4 reviewed papers. Its strongest current cluster is cs.LG (2 papers). The largest review-status bucket among citing papers is UNVERDICTED (4 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
verdicts
UNVERDICTED 4representative citing papers
Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
A new MoE training method integrates expert-level losses and partial online updates to improve forecasting accuracy and efficiency over standard statistical and neural models.
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
citing papers explorer
-
Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts
Boundary mass in MoE is linear in slab width under smoothness and transversality, so the zero-temperature limit is governed by a thin geometric layer around routing interfaces rather than the full input space.
-
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
-
Fast Training of Mixture-of-Experts for Time Series Forecasting via Expert Loss Integration
A new MoE training method integrates expert-level losses and partial online updates to improve forecasting accuracy and efficiency over standard statistical and neural models.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.