Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.
InInter- national Conference on Learning Representations
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.
citing papers explorer
-
How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws
Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.
-
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.