pith. machine review for the scientific record. sign in

Title resolution pending

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

citing papers explorer

Showing 5 of 5 citing papers.

  • SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 9

    Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.

  • The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling cs.AI · 2026-05-04 · unverdicted · none · ref 3

    APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.

  • Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 100

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  • Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 109

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  • Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 121

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.