hub

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen · 2024 · cs.CL · arXiv 2401.06066

42 Pith papers cite this work. Polarity classification is still indexing.

42 Pith papers citing it

open full Pith review browse 42 citing papers arXiv PDF

abstract

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts; (2) isolating $K_s$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

claims ledger

abstract In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a m

co-cited works

representative citing papers

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Mixture of Layers with Hybrid Attention

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.

SDG-MoE: Signed Debate Graph Mixture-of-Experts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

OneTrackerV2 unifies multimodal tracking via Meta Merger and Dual Mixture-of-Experts to reach state-of-the-art results on five tasks and 12 benchmarks with efficiency and robustness when modalities are missing.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Combining pre-trained models via localized model averaging

stat.ME · 2026-05-13 · unverdicted · novelty 6.0

Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.

Hierarchical Mixture-of-Experts with Two-Stage Optimization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS methods on ImageNet and transfer datasets.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

cs.AR · 2026-05-07 · unverdicted · novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.

Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

cs.DC · 2026-05-06 · unverdicted · novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.

Rethinking LLM Ensembling from the Perspective of Mixture Models

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.

SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

Geometric Routing Enables Causal Expert Control in Mixture of Experts

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.

citing papers explorer

Showing 42 of 42 citing papers.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models cs.AR · 2026-05-11 · conditional · none · ref 11 · internal anchor
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts cs.LG · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference cs.DC · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
Mixture of Layers with Hybrid Attention cs.LG · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.
SDG-MoE: Signed Debate Graph Mixture-of-Experts cs.LG · 2026-05-08 · unverdicted · none · ref 11 · 2 links · internal anchor
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference cs.LG · 2026-05-08 · conditional · none · ref 6 · internal anchor
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Unified Multimodal Visual Tracking with Dual Mixture-of-Experts cs.CV · 2026-05-05 · unverdicted · none · ref 2 · internal anchor
OneTrackerV2 unifies multimodal tracking via Meta Merger and Dual Mixture-of-Experts to reach state-of-the-art results on five tasks and 12 benchmarks with efficiency and robustness when modalities are missing.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 13 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks cs.CR · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 11 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Combining pre-trained models via localized model averaging stat.ME · 2026-05-13 · unverdicted · none · ref 173 · internal anchor
Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces cs.DC · 2026-05-11 · unverdicted · none · ref 63 · internal anchor
Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 4 · internal anchor
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts cs.CV · 2026-05-08 · unverdicted · none · ref 8 · internal anchor
TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS methods on ImageNet and transfer datasets.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems cs.AR · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 17 · internal anchor
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
Rethinking LLM Ensembling from the Perspective of Mixture Models cs.LG · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws cs.LG · 2026-04-27 · unverdicted · none · ref 147 · internal anchor
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs cs.CV · 2026-04-27 · unverdicted · none · ref 11 · internal anchor
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling cs.CL · 2026-04-23 · unverdicted · none · ref 15 · internal anchor
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 6 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Geometric Routing Enables Causal Expert Control in Mixture of Experts cs.AI · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise cs.AI · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
Expert specialization in MoEs is an emergent effect of hidden state geometry due to linear routers, not domain expertise, as confirmed empirically across models and explained by a proof on load-balancing effects.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning cs.LG · 2026-04-10 · unverdicted · none · ref 156 · internal anchor
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism cs.LG · 2026-04-03 · unverdicted · none · ref 8 · internal anchor
A novel adaptive MoE-based semantic communication system jointly routes experts using real-time CSI and semantic image content for improved MIMO wireless image transmission.
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification cs.LG · 2026-05-12 · unverdicted · none · ref 46 · internal anchor
MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.
E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology cs.LG · 2026-05-07 · unverdicted · none · ref 17 · internal anchor
A dimensionless parameter E = T*H/(O+B) >= 0.5 is claimed to guarantee zero dead experts in Mixture-of-Experts models, eliminating the need for auxiliary load-balancing losses.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring cs.CL · 2026-05-04 · unverdicted · none · ref 49 · internal anchor
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving cs.DC · 2026-04-29 · unverdicted · none · ref 4 · internal anchor
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection cs.CV · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
Domain-Specialized Object Detection via Model-Level Mixtures of Experts cs.CV · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation cs.CV · 2026-04-08 · unverdicted · none · ref 7 · internal anchor
HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
Does a Global Perspective Help Prune Sparse MoEs Elegantly? cs.CL · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
GRAPE is a global redundancy-aware pruning strategy for sparse MoEs that dynamically allocates pruning budgets across layers and improves average accuracy by 1.40% over the best local baseline across tested models and settings.
Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 9 · internal anchor
Pith review generated a malformed one-line summary.
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment cs.IR · 2025-02-26 · unverdicted · none · ref 7 · internal anchor
OneRec unifies retrieval and ranking in a generative recommender using session-wise decoding and iterative DPO-based preference alignment, achieving real-world gains on Kuaishou.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 21 · internal anchor
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CV · 2026-05-04 · unverdicted · none · ref 11 · internal anchor
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation cs.CV · 2026-04-28 · unverdicted · none · ref 13 · internal anchor
CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer