Optimal learning rate for models from 22M to 707M parameters shows nonlinear upward curvature with scale that disappears under effective learning rate and data-scale extrapolation.
Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
citing papers explorer
-
On the Nonlinearity of Learning Rate Scaling for LLM Training
Optimal learning rate for models from 22M to 707M parameters shows nonlinear upward curvature with scale that disappears under effective learning rate and data-scale extrapolation.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.