D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.
Data mixing laws: Optimizing data mixtures by predicting language modeling performance
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
Transfer-aware data allocation derived from observed power-law scaling laws for asymmetric knowledge transfer in 3D medical imaging outperforms standard proportional sampling by up to 58% and generalizes to new budgets.
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
citing papers explorer
-
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.