RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
arXiv preprint arXiv:2507.17702 , year=
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10roles
background 2polarities
background 2representative citing papers
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Optimus enables elastic decoding granularity adaptation in diffusion LLMs via chunked decoding and load-based scheduling to raise throughput under dynamic conditions.
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
A covering-number bound and ERM analysis for MoE Transformers under manifold data produces generalization and scaling laws that treat active capacity like dense networks while adding routing overhead.
DAG-MoE uses a lightweight module to learn DAG-based structural aggregation of selected experts, expanding combination space and enabling intra-layer multi-step reasoning compared to standard weighted-sum MoE.
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.
citing papers explorer
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
-
Optimus: Elastic Decoding for Efficient Diffusion LLM Serving
Optimus enables elastic decoding granularity adaptation in diffusion LLMs via chunked decoding and load-based scheduling to raise throughput under dynamic conditions.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Geometric Routing Enables Causal Expert Control in Mixture of Experts
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
-
Generalization and Scaling Laws for Mixture-of-Experts Transformers
A covering-number bound and ERM analysis for MoE Transformers under manifold data produces generalization and scaling laws that treat active capacity like dense networks while adding routing overhead.
-
DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts
DAG-MoE uses a lightweight module to learn DAG-based structural aggregation of selected experts, expanding combination space and enabling intra-layer multi-step reasoning compared to standard weighted-sum MoE.
-
ZONOS2 Technical Report
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.