Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Pith review generated a malformed one-line summary.
citing papers explorer
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
Qwen3 Technical Report
Pith review generated a malformed one-line summary.