pith. sign in

arxiv: 2603.06626 · v2 · pith:5CNFGOSXnew · submitted 2026-02-22 · 💻 cs.LG · cs.AI

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

classification 💻 cs.LG cs.AI
keywords grouterroutingtrainingexpertmodelpreemptivestructuralachieves
0
0 comments X
read the original abstract

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training. We publicly release our code and pretrained Grouter checkpoints at https://github.com/JimmyAwoe/Grouter.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

  2. Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

    cs.AI 2026-04 conditional novelty 7.0

    Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.

  3. Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    EDAS modulates advantage signals in RLVR to penalize repeated errors more and rare errors less, yielding consistent gains on math benchmarks when added to existing methods.

  4. Step-wise Rubric Rewards for LLM Reasoning

    cs.LG 2026-05 conditional novelty 6.0

    SRaR attributes rubric items to specific steps via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards via a decoupled advantage estimator, yielding 3.57-point accuracy gai...