Grovemoe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785

· 2025 · arXiv 2508.07785

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

Post-Trained MoE Can Skip Half Experts via Self-Distillation

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

ZEDA turns post-trained static MoE models into dynamic ones via zero-output expert injection and two-stage self-distillation, cutting over 50% expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash with small accuracy drops across 11 benchmarks.

SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts

cs.LG · 2025-10-09 · unverdicted · novelty 5.0

Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

cs.LG · 2026-06-05 · unverdicted · novelty 4.0

A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 55 · 2 links
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Post-Trained MoE Can Skip Half Experts via Self-Distillation cs.LG · 2026-05-18 · unverdicted · none · ref 5 · 2 links
ZEDA turns post-trained static MoE models into dynamic ones via zero-output expert injection and two-stage self-distillation, cutting over 50% expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash with small accuracy drops across 11 benchmarks.
Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts cs.LG · 2025-10-09 · unverdicted · none · ref 25
Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.
Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling cs.LG · 2026-06-05 · unverdicted · none · ref 67
A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.

Grovemoe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer