arXiv preprint arXiv:2507.17702 , year=

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models , author= · 2025 · arXiv 2507.17702

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

cs.AI · 2026-04-15 · conditional · novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.

Optimus: Elastic Decoding for Efficient Diffusion LLM Serving

cs.DC · 2026-05-24 · unverdicted · novelty 6.0

Optimus enables elastic decoding granularity adaptation in diffusion LLMs via chunked decoding and load-based scheduling to raise throughput under dynamic conditions.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Geometric Routing Enables Causal Expert Control in Mixture of Experts

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.

Generalization and Scaling Laws for Mixture-of-Experts Transformers

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

A covering-number bound and ERM analysis for MoE Transformers under manifold data produces generalization and scaling laws that treat active capacity like dense networks while adding routing overhead.

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

cs.AI · 2026-05-31 · unverdicted · novelty 5.0

DAG-MoE uses a lightweight module to learn DAG-based structural aggregation of selected experts, expanding combination space and enabling intra-layer multi-step reasoning compared to standard weighted-sum MoE.

ZONOS2 Technical Report

cs.SD · 2026-06-23 · unverdicted · novelty 4.0

ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.

citing papers explorer

Showing 10 of 10 citing papers.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 62
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 51 · 2 links
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality cs.AI · 2026-04-15 · conditional · none · ref 22
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Optimus: Elastic Decoding for Efficient Diffusion LLM Serving cs.DC · 2026-05-24 · unverdicted · none · ref 42
Optimus enables elastic decoding granularity adaptation in diffusion LLMs via chunked decoding and load-based scheduling to raise throughput under dynamic conditions.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · unverdicted · none · ref 176 · 3 links
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 191
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Geometric Routing Enables Causal Expert Control in Mixture of Experts cs.AI · 2026-04-15 · unverdicted · none · ref 11
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
Generalization and Scaling Laws for Mixture-of-Experts Transformers cs.LG · 2026-04-10 · unverdicted · none · ref 3
A covering-number bound and ERM analysis for MoE Transformers under manifold data produces generalization and scaling laws that treat active capacity like dense networks while adding routing overhead.
DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts cs.AI · 2026-05-31 · unverdicted · none · ref 26
DAG-MoE uses a lightweight module to learn DAG-based structural aggregation of selected experts, expanding combination space and enabling intra-layer multi-step reasoning compared to standard weighted-sum MoE.
ZONOS2 Technical Report cs.SD · 2026-06-23 · unverdicted · none · ref 232
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.

arXiv preprint arXiv:2507.17702 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer