hub Canonical reference

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen · 2024 · cs.CL · arXiv 2401.06066

Canonical reference. 88% of citing Pith papers cite this work as background.

97 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 97 citing papers arXiv PDF

abstract

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts; (2) isolating $K_s$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 21 baseline 1 dataset 1 method 1

citation-polarity summary

background 21 baseline 1 use dataset 1 use method 1

claims ledger

abstract In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a m

co-cited works

representative citing papers

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Towards One-to-Many Temporal Grounding

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

cs.IR · 2026-05-26 · unverdicted · novelty 7.0

L2Rec introduces dual-view personalized low-rank perturbations via DPMoE to let one LLM backbone produce complementary behavioral and semantic adaptations, with cross-view fusion, outperforming baselines on four datasets and in industrial A/B tests.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Mixture of Layers with Hybrid Attention

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.

SDG-MoE: Signed Debate Graph Mixture-of-Experts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

MoE experts in pretrained Transformers exhibit functional decorrelation with near-zero Jacobian alignment yet occupy partially overlapping representation subspaces, with routing sparsity modulating the geometry.

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

OneTrackerV2 unifies multimodal tracking via Meta Merger and Dual Mixture-of-Experts to reach state-of-the-art results on five tasks and 12 benchmarks with efficiency and robustness when modalities are missing.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.

Path-Constrained Mixture-of-Experts

cs.LG · 2026-03-18 · unverdicted · novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.

Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

cs.LG · 2026-02-13 · unverdicted · novelty 7.0

Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plus built-in robustness and per-sample contribution scores.

From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums

cs.AI · 2026-02-04 · unverdicted · novelty 7.0

A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.

S$^2$GR: Stepwise Semantic-Guided Reasoning in Latent Space for Generative Recommendation

cs.IR · 2026-01-26 · unverdicted · novelty 7.0

S²GR adds stepwise thinking tokens with contrastive supervision on codebook clusters to balance computational focus and ground reasoning paths in generative recommendation.

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance

cs.DC · 2025-09-09 · unverdicted · novelty 7.0

DuoServe-MoE decouples prefill and decode phases in MoE LLM inference with a two-stream CUDA pipeline for prefill and an offline-trained predictor for decode, reporting up to 5.34x TTFT and 7.55x end-to-end latency gains.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

cs.LG · 2024-08-28 · conditional · novelty 7.0

Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

DLLG: Dynamic Logit-Level Gating of LLM Experts

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

DLLG learns token-level fusion weights for LLM experts from sparse response supervision and outperforms routing, ensembling, and merging baselines on reasoning and code tasks.

Expert-Aware Refusal Steering

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.

State Machine Guided Multi-Relational Synthetic Data from Logs for Anomaly Detection

cs.MA · 2026-05-30 · unverdicted · novelty 6.0

A framework extracts a latent state machine from logs, induces a multi-table relational schema, and uses it as a generative prior to create synthetic data that augments real logs for better anomaly detection.

MoG: Mixture of Experts for Graph-based Retrieval-Augmented Generation

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

MoG uses hub graphs for shared context and sparsely activates expert graphs with a topology-aware router, reporting over 20% relative gains on MuSiQue.

citing papers explorer

Showing 1 of 1 citing paper after filters.

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment cs.IR · 2025-02-26 · unverdicted · none · ref 7 · internal anchor
OneRec unifies retrieval and ranking in a generative recommender using session-wise decoding and iterative DPO-based preference alignment, achieving real-world gains on Kuaishou.

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer