pith. machine review for the scientific record. sign in

arxiv: 2605.07260 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CL

Recognition: no theorem link

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

Jungseul Ok, Siwei Wang, Wei Chen, Youngsik Yoon

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords mixture of expertsrouting analysiscounterfactual evaluationlanguage modelsreasoning tokensmodel optimizationexpert allocation
0
0 comments X

The pith

In mixture-of-experts models the standard router selects worse routes than available alternatives precisely on the uncertain tokens that drive hard reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that holding experts fixed and scoring each token's standard route against sampled equal-compute alternatives by next-token probability on verified trajectories reveals a sharp split. On confident tokens the router already picks near-optimal routes, but on fragile tokens better lower-loss routes consistently exist inside the frozen model yet go unselected. This pattern follows directly from training that scores only the executed route and balances load only in aggregate. A minimal update to the final-layer router alone, with every expert and prior router left untouched, raises pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B.

Core claim

Holding the model fixed, the standard top-k router assigns routes that match the utility of sampled equal-compute alternatives on confident tokens but fail to do so on fragile tokens, where alternatives with higher next-token probability for the realized token exist but are not selected; this pattern is structural to the training objective and a minimal router-only update suffices to improve downstream performance.

What carries the argument

Counterfactual comparison of the standard route against sampled equal-compute alternatives, scored by the next-token probability assigned to the realized token in a verified reasoning trajectory.

If this is right

  • The router aligns with route utility on confident tokens.
  • Lower-loss equal-compute routes exist for fragile tokens but remain unselected.
  • A router-only update to the final layer alone improves pass@K on AIME 2024+2025 and HMMT 2025.
  • The misalignment appears across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B.
  • Training scores only the executed route and aggregate load statistics, leaving individual route quality unoptimized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Some performance limits attributed to expert capacity may instead be reachable by better route selection within the existing model.
  • Training objectives could incorporate direct signals about route quality on uncertain tokens rather than relying on indirect load balancing.
  • The same counterfactual method could be applied to identify misrouting in other conditional-computation or sparse architectures.

Load-bearing premise

That the sampled equal-compute alternative routes represent the space of possible routes and that next-token probability on a verified trajectory serves as a sufficient proxy for overall route quality on downstream tasks.

What would settle it

Observing no increase in pass@K on AIME 2024+2025 and HMMT 2025 after the minimal final-layer router update, or finding that sampled alternatives never show lower loss than the standard route on fragile tokens.

Figures

Figures reproduced from arXiv: 2605.07260 by Jungseul Ok, Siwei Wang, Wei Chen, Youngsik Yoon.

Figure 1
Figure 1. Figure 1: Counterfactual routing statistics by confidence bin. Tokens are grouped by p¯, the mean realized-token probability over sampled equal-compute alternative routes: Confident (p >¯ 0.9), Ambiguous (0.5 < p¯ ≤ 0.9), and Fragile (p¯ ≤ 0.5). Top-1 reports how often the standard top-k route is best among evaluated routes. pstd and pbest denote realized-token probabilities under the standard and best evaluated rou… view at source ↗
Figure 2
Figure 2. Figure 2: Routing quality degrades sharply on low-confidence tokens. (a) Standard route proba￾bility vs. best route probability across token difficulty, showing gap as tokens become harder. (b) Standard route performance measured by Top-K rates. Based on this successful path, we report the standard-route probability pt,ℓ(S std t,ℓ ), the best-route probability maxS pt,ℓ(S), their gap, and the Top-K rates for K ∈ {1,… view at source ↗
Figure 3
Figure 3. Figure 3: Pass@K curves on AIME 2024+2025 and HMMT 2025. For each problem, we pool n = 160 completions and compute pass@K with the standard combinatorial estimator qbp(K) = 1 − n−cp K  / n K  , where cp is the number of correct completions. Scores are averaged over problems. Shaded bands show 95% bootstrap confidence intervals; see Appendix C for details [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top-$k$ router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top-$k$ training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard top-k routers in MoE language models are well-aligned with route utility on confident tokens but uninformative on fragile tokens that drive hard reasoning, where lower-loss equal-compute alternative routes exist inside the frozen model. This pattern is shown via counterfactual scoring of routes by next-token probability on verified trajectories, holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, follows from the training objective, and is addressed by a minimal final-layer router update that improves pass@K on AIME 2024+2025 and HMMT 2025.

Significance. If the central empirical pattern holds, the work isolates routing decisions as a separable and addressable source of suboptimality in MoE models, distinct from expert capacity limits. The token-conditional analysis and the demonstration that router-only updates (leaving experts and other routers frozen) can shift downstream math benchmark performance provide a concrete, low-cost intervention path. Cross-model replication and the structural explanation tied to the LM loss and load-balancing objective add generality.

major comments (3)
  1. [§4] §4 (Counterfactual Analysis): The scoring of alternative routes solely by next-token probability assigned to the realized token from the original trajectory assumes invariance of the target under routing changes. For fragile tokens this assumption is load-bearing, because an alternative route alters the hidden state and subsequent distribution, so the original token may no longer be the relevant target; the paper does not provide a direct test of this invariance.
  2. [§5] §5 (Router Update Experiment): While the final-layer router fine-tuning improves pass@K, the experiment does not retroactively validate the per-token proxy used to label tokens as fragile or to assert existence of superior routes inside the frozen model; the benchmark gains could arise from a different mechanism than the one identified in the counterfactual analysis.
  3. [Methods] Methods section: The sampling procedure for equal-compute alternatives, the definition of 'verified reasoning trajectory,' the number of alternatives per token, and controls for multiple random seeds or statistical significance of the misalignment metric are not specified with sufficient detail to allow independent verification of the token-conditional claims.
minor comments (2)
  1. [Abstract] Abstract: 'Fragile tokens' and 'confident tokens' are introduced without a precise operational definition (e.g., threshold on loss or entropy); a short formal definition would improve readability.
  2. [Figures/Tables] Figure captions and tables reporting per-token statistics should include the exact number of tokens and models over which the 'uninformative' claim is aggregated.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback, which identifies key areas for clarification and strengthening. We agree that the Methods section requires expansion for reproducibility and that the links between the counterfactual proxy and the router update results need to be made more explicit. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Counterfactual Analysis): The scoring of alternative routes solely by next-token probability assigned to the realized token from the original trajectory assumes invariance of the target under routing changes. For fragile tokens this assumption is load-bearing, because an alternative route alters the hidden state and subsequent distribution, so the original token may no longer be the relevant target; the paper does not provide a direct test of this invariance.

    Authors: We acknowledge that scoring alternatives by the probability assigned to the original realized token assumes the target remains relevant despite the altered hidden state. This is a simplifying assumption in the proxy, and for fragile tokens it is particularly important since the continuation could shift. The current manuscript does not include a direct test, such as executing alternative routes and measuring changes in the subsequent target distribution. We will revise §4 to explicitly state this limitation of the local proxy metric, discuss its implications for interpretation, and note that the structural argument from the LM loss and load-balancing objective, together with cross-model replication, provides complementary support for the observed pattern. revision: partial

  2. Referee: [§5] §5 (Router Update Experiment): While the final-layer router fine-tuning improves pass@K, the experiment does not retroactively validate the per-token proxy used to label tokens as fragile or to assert existence of superior routes inside the frozen model; the benchmark gains could arise from a different mechanism than the one identified in the counterfactual analysis.

    Authors: We agree that the router-update results demonstrate a performance benefit from final-layer router adjustment but do not directly validate that the gains arise specifically from correcting the fragile-token misroutes identified by the counterfactual proxy; other mechanisms are possible. In the revision we will add analysis that examines how the fine-tuned router alters routing decisions on the tokens previously labeled fragile, and we will correlate those changes with the original proxy scores. We will also include a discussion of alternative explanations for the pass@K improvements to clarify the connection between the two parts of the work. revision: partial

  3. Referee: Methods section: The sampling procedure for equal-compute alternatives, the definition of 'verified reasoning trajectory,' the number of alternatives per token, and controls for multiple random seeds or statistical significance of the misalignment metric are not specified with sufficient detail to allow independent verification of the token-conditional claims.

    Authors: We thank the referee for highlighting the insufficient detail. In the revised manuscript we will expand the Methods section to specify: the exact sampling procedure used to generate equal-compute alternative routes, the definition and verification criteria for reasoning trajectories, the number of alternatives evaluated per token, the random seeds employed, and the statistical tests (including significance thresholds) applied to the misalignment metrics. These additions will enable independent replication of the token-conditional findings. revision: yes

standing simulated objections not resolved
  • A direct empirical test of the invariance of the target token under alternative routing for fragile tokens.

Circularity Check

0 steps flagged

No significant circularity; empirical analysis is self-contained

full rationale

The paper performs direct empirical comparisons inside frozen models: for each token it samples equal-compute alternative routes and scores them by next-token probability on the realized token from a verified trajectory. No equations or derivations reduce by construction to fitted parameters, self-definitions, or self-citations. The structural observation about top-k training (loss only on executed route, load balancing on aggregates) is a general property of the objective, not a self-referential prediction. The router-only update experiment measures downstream pass@K directly and does not rely on the per-token proxy for its validity. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The analysis relies on standard assumptions of MoE architectures and language modeling objectives but introduces no new free parameters, axioms, or invented entities beyond those already present in the cited models.

pith-pipeline@v0.9.0 · 5572 in / 1173 out tokens · 28540 ms · 2026-05-11T01:41:50.259352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 16 internal anchors

  1. [1]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  2. [2]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  3. [3]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

  4. [4]

    Neural computation , volume=

    Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

  5. [5]

    International conference on machine learning , pages=

    Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=

  6. [6]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  7. [7]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  9. [9]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  10. [10]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  11. [11]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

  12. [12]

    International Conference on Machine Learning , pages=

    Base layers: Simplifying training of large, sparse models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Mixture-of-experts with expert choice routing , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951, 2023

    From sparse to soft mixtures of experts , author=. arXiv preprint arXiv:2308.00951 , year=

  15. [15]

    arXiv preprint arXiv:2412.14711 , year=

    Remoe: Fully differentiable mixture-of-experts with relu routing , author=. arXiv preprint arXiv:2412.14711 , year=

  16. [16]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv: 2408.15664,

    Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=

  17. [17]

    OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

    Openmoe: An early effort on open mixture-of-experts language models , author=. arXiv preprint arXiv:2402.01739 , year=

  18. [18]

    Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi

    Olmoe: Open mixture-of-experts language models , author=. arXiv preprint arXiv:2409.02060 , year=

  19. [19]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    A closer look into mixture-of-experts in large language models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  20. [20]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Let the expert stick to his last: Expert-specialized fine-tuning for sparse architectural large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  21. [21]

    NeurIPS 2025 Workshop on Structured Probabilistic Inference \ & \ Generative Modeling , year=

    DenseMixer: Improving MoE Post-Training with Precise Router Gradient , author=. NeurIPS 2025 Workshop on Structured Probabilistic Inference \ & \ Generative Modeling , year=

  22. [22]

    Findings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

    Exploration-Driven Reinforcement Learning for Expert Routing Improvement in Mixture-of-Experts Language Models , author=. Findings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

  23. [23]

    arXiv preprint arXiv:2509.22745 , year=

    Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment , author=. arXiv preprint arXiv:2509.22745 , year=

  24. [24]

    arXiv preprint arXiv:2603.24984 , year=

    MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models , author=. arXiv preprint arXiv:2603.24984 , year=

  25. [25]

    The Fourteenth International Conference on Learning Representations , year=

    Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards , author=. The Fourteenth International Conference on Learning Representations , year=

  26. [26]

    Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE

    Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE , author=. arXiv preprint arXiv:2602.02443 , year=

  27. [27]

    arXiv e-prints , pages=

    Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs , author=. arXiv e-prints , pages=

  28. [28]

    arXiv preprint arXiv:2504.07964 , year=

    C3po: Critical-layer, core-expert, collaborative pathway optimization for test-time expert re-mixing , author=. arXiv preprint arXiv:2504.07964 , year=

  29. [29]

    Zhongyang Li, Ziyue Li, and Tianyi Zhou

    R2-t2: Re-routing in test-time for multimodal mixture-of-experts , author=. arXiv preprint arXiv:2502.20395 , year=

  30. [30]

    Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

    Awakening Dormant Experts: Counterfactual Routing to Mitigate MoE Hallucinations , author=. arXiv preprint arXiv:2604.14246 , year=

  31. [31]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

  32. [32]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  33. [33]

    The twelfth international conference on learning representations , year=

    Let's verify step by step , author=. The twelfth international conference on learning representations , year=

  34. [34]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  35. [35]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Direct preference optimization with an offset , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  36. [36]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Wpo: Enhancing rlhf with weighted preference optimization , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  37. [37]

    Phi-4 Technical Report

    Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

  38. [38]

    Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability.arXiv preprint arXiv:2411.19943,

    Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability , author=. arXiv preprint arXiv:2411.19943 , year=

  39. [39]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

  40. [40]

    Stablemoe: Stable routing strategy for mixture of experts, 2022

    Stablemoe: Stable routing strategy for mixture of experts , author=. arXiv preprint arXiv:2204.08396 , year=

  41. [41]

    Advances in Neural Information Processing Systems , volume=

    On the representation collapse of sparse mixture of experts , author=. Advances in Neural Information Processing Systems , volume=

  42. [42]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=