Grouter: Decoupling routing from representation for accelerated moe training

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan · 2026 · cs.LG · arXiv 2603.06626

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training. We publicly release our code and pretrained Grouter checkpoints at https://github.com/JimmyAwoe/Grouter.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

cs.AI · 2026-04-15 · conditional · novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.

Step-wise Rubric Rewards for LLM Reasoning

cs.LG · 2026-05-17 · conditional · novelty 6.0

SRaR attributes rubric items to specific steps via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards via a decoupled advantage estimator, yielding 3.57-point accuracy gains on Qwen3-8B across math benchmarks.

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

cs.LG · 2026-05-17 · unverdicted · novelty 5.0 · 2 refs

EDAS modulates RL advantage signals for incorrect rollouts by amplifying penalties on repeated errors and attenuating them on rare ones, yielding average gains of 6.29 points over DAPO on Qwen3-8B across seven math benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts cs.LG · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Grouter: Decoupling routing from representation for accelerated moe training

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer