Advances in Neural Information Processing Systems , volume=

On the representation collapse of sparse mixture of experts , author=

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

citing papers explorer

Showing 2 of 2 citing papers.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 41
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 64
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

Advances in Neural Information Processing Systems , volume=

fields

years

verdicts

representative citing papers

citing papers explorer