o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
European conference on machine learning , pages=
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
citing papers explorer
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.