HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus

· 2025 · arXiv 2504.03871

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

cs.DC · 2025-12-13 · unverdicted · novelty 7.0

HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

cs.DC · 2026-04-21 · unverdicted · novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

citing papers explorer

Showing 4 of 4 citing papers.

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments cs.DC · 2025-12-13 · unverdicted · none · ref 44
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory cs.DC · 2026-05-20 · unverdicted · none · ref 7
DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 30
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training cs.DC · 2026-04-21 · unverdicted · none · ref 43
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer