pith. machine review for the scientific record. sign in

arxiv: 2601.21698 · v2 · submitted 2026-01-29 · 💻 cs.LG · cs.AI

Recognition: unknown

Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics

Hadi Amiri, Mohamed Elgaar

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords learningcurriculumorderingphasestrainingchangescurriculadynamics
0
0 comments X
read the original abstract

Curriculum learning changes the order of pretraining data, but it remains unclear how ordering changes the learning dynamics. We pretrain models from 14M to 1B parameters for 300B tokens under three linguistically motivated curricula--Age-of-Acquisition, word frequency, and Verb Variation (VV)--and compare each against Random ordering. We analyze latent training phases, gradient noise scale (GNS), and the singular-value structure of the output head. We find that training follows a shared sequence of latent phases, while curricula mainly change time spent in each phase. Random ordering yields higher GNS at 14M-70M and late singular-entropy spikes up to 160M, consistent with noisier gradients and output-head saturation. A reverse-order VV control shows that direction matters: descending order loses much of the accuracy advantage of the ascending curriculum. At larger scales, these stability differences are smaller. These results indicate that the curricula studied here are associated with more stable within-phase training in smaller models rather than with the creation of new phases.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

    cs.CL 2026-05 unverdicted novelty 7.0

    Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.