Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Upcycling large language models into mixture of experts
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
A literature survey that categorizes how Mixture-of-Experts architectures address multimodal learning challenges and identifies open research gaps.
Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.
citing papers explorer
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
SpikingBrain: Spiking Brain-inspired Large Models
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
-
Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey
A literature survey that categorizes how Mixture-of-Experts architectures address multimodal learning challenges and identifies open research gaps.
-
Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts
Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.