hub

Auxiliary-loss-free load balancing strategy for mixture-of-experts

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, Damai Dai · 2024 · arXiv 2408.15664

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

Conditional Memory Enhanced Item Representation for Generative Recommendation

cs.IR · 2026-05-12 · unverdicted · novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

cs.LG · 2026-05-11 · conditional · novelty 6.0 · 2 refs

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

Hierarchical Mixture-of-Experts with Two-Stage Optimization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

SPHERE applies a Parseval penalty to MoE policies in continual RL to maintain spectral plasticity, yielding 133% and 50% higher average success on MetaWorld and HumanoidBench versus unregularized MoE baselines.

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A dimensionless parameter E = T*H/(O+B) >= 0.5 is claimed to guarantee zero dead experts in Mixture-of-Experts models, eliminating the need for auxiliary load-balancing losses.

Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study

cs.LG · 2026-04-19 · conditional · novelty 5.0

Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA-guided gate over a simple MLP gate.

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

cs.CL · 2026-04-03 · unverdicted · novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

cs.CV · 2024-12-13 · accept · novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.

EMO: Frustratingly Easy Progressive Training of Extendable MoE

cs.LG · 2026-05-13 · unverdicted · novelty 4.0

EMO progressively expands the expert pool in MoE models using scaling-law-derived token budgets per stage, matching fixed-expert performance while cutting wall-clock time and GPU cost.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

cs.CV · 2026-05-04 · unverdicted · novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

cs.DC · 2026-05-02 · accept · novelty 4.0

LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

cs.CL · 2025-08-08 · unverdicted · novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

citing papers explorer

Showing 18 of 18 citing papers.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts cs.LG · 2026-05-12 · unverdicted · none · ref 7
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference cs.DC · 2026-05-11 · unverdicted · none · ref 31
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 16
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG · 2026-04-24 · unverdicted · none · ref 29
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 52 · 2 links
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Conditional Memory Enhanced Item Representation for Generative Recommendation cs.IR · 2026-05-12 · unverdicted · none · ref 40
ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · conditional · none · ref 174 · 2 links
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 38
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 51
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning cs.LG · 2026-05-06 · unverdicted · none · ref 47 · 2 links
SPHERE applies a Parseval penalty to MoE policies in continual RL to maintain spectral plasticity, yielding 133% and 50% higher average success on MetaWorld and HumanoidBench versus unregularized MoE baselines.
E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology cs.LG · 2026-05-07 · unverdicted · none · ref 6
A dimensionless parameter E = T*H/(O+B) >= 0.5 is claimed to guarantee zero dead experts in Mixture-of-Experts models, eliminating the need for auxiliary load-balancing losses.
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study cs.LG · 2026-04-19 · conditional · none · ref 6
Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA-guided gate over a simple MLP gate.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 13
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 87
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
EMO: Frustratingly Easy Progressive Training of Extendable MoE cs.LG · 2026-05-13 · unverdicted · none · ref 10
EMO progressively expands the expert pool in MoE models using scaling-law-derived token budgets per stage, matching fixed-expert performance while cutting wall-clock time and GPU cost.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CV · 2026-05-04 · unverdicted · none · ref 30
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics cs.DC · 2026-05-02 · accept · none · ref 32
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models cs.CL · 2025-08-08 · unverdicted · none · ref 40
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

Auxiliary-loss-free load balancing strategy for mixture-of-experts

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer