DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Heterogeneous SYCL-based CG and Cholesky solvers deliver up to 32% and 29% faster runtimes than GPU-only versions for large matrices across multiple GPU vendors.
citing papers explorer
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
Heterogeneous SYCL-based CG and Cholesky solvers deliver up to 32% and 29% faster runtimes than GPU-only versions for large matrices across multiple GPU vendors.