arxiv: 2605.02364 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Binbin Liu, Bingni Zhang, Fengze Liu, Ping Guo, Taifeng Wang, Weidong Zhou, Xiaohuan Zhou, Yifan Zhang, Yifeng Yu, Zijun Wang

Pith reviewed 2026-05-08 18:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords scaling lawslarge language modelsdata mixturesrepetitioninformation accumulationpretrainingovertraining

0 comments

The pith

InfoLaw predicts LLM performance by modeling pretraining as the accumulation of information whose density is set by data quality and whose returns diminish with scale-dependent repetition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard scaling laws for language models do not reliably forecast results when data mixtures change or when tokens must be repeated because high-quality data is scarce. InfoLaw treats each training token as carrying a variable amount of information: higher-quality data delivers more information per token, while repetition produces diminishing returns that grow stronger at larger model scales. The authors collect performance data across many combinations of model size, total tokens, mixture weights, and repetition levels, then fit a functional form that maps consumed information to final loss. Once fitted, the law forecasts loss on entirely new mixture recipes and on runs up to 7 billion parameters and 425 billion tokens with mean absolute error of 0.15 percent and maximum error of 0.96 percent. This accuracy across overtraining regimes makes it possible to choose the best data recipe for any compute budget without running the full training.

Core claim

By representing pretraining as information accumulation in which quality sets information density per token and repetition induces scale-dependent diminishing returns, InfoLaw produces loss predictions that match observed values on unseen data mixtures and larger scales (up to 7B parameters, 425B tokens) with 0.15 percent mean and 0.96 percent maximum absolute error while remaining reliable under varying degrees of overtraining.

What carries the argument

The information accumulation model that separates quality-controlled information density from scale-dependent repetition effects and converts total accumulated information into predicted loss.

If this is right

Data-recipe selection for any compute budget can be performed by maximizing predicted information per token rather than by exhaustive search over mixtures.
The same functional form extrapolates loss across overtraining levels, allowing reliable forecasts even when training runs exceed the point of optimal data efficiency.
Mixture weights and repetition schedules can be optimized jointly with model size and total tokens inside a single predictive equation.
New data sources can be incorporated by measuring their quality-derived information density and inserting the value into the existing accumulation equation without refitting the entire law.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same information-density logic could be tested on continued pretraining or domain-adaptation runs to predict how much new data is required to reach a target loss.
If quality can be estimated cheaply for web-scale corpora, InfoLaw supplies an objective function for automatic data filtering that directly targets training efficiency.
Extending the repetition term to account for semantic rather than exact duplication might improve accuracy when training on synthetic or paraphrased data.

Load-bearing premise

The fitted relationship between accumulated information and loss continues to hold for data distributions, model architectures, and scales outside the range of the experiments used to build the model.

What would settle it

Train a 13B model on a previously untested high-repetition mixture whose predicted loss under InfoLaw differs by more than 1 percent from the measured loss after 400 billion tokens.

Figures

Figures reproduced from arXiv: 2605.02364 by Binbin Liu, Bingni Zhang, Fengze Liu, Ping Guo, Taifeng Wang, Weidong Zhou, Xiaohuan Zhou, Yifan Zhang, Yifeng Yu, Zijun Wang.

**Figure 1.** Figure 1: Validation loss versus compute Cm in the loss–C view under LayerMix data with repetition. Curves are fit on 252M–1.2B and extrapolated to larger models. The traditional scaling law mis-extrapolates under repetition, while InfoLaw tracks both interpolation and extrapolation across recipes (HQ and MLQ). datasets that vary along three axes: scale, quality, and repetition level. Specifically, we partition the… view at source ↗

**Figure 2.** Figure 2: The fitted function of quality density function and relationship between λ(N) and N by the multiplication of number of unique tokens Md = min(wdK, BdS) and information density fd, which is a parameterized quality density function. Rd = wdK Md is the average repeat times for the data from the d-th bucket and λ(N) is related with N, which are to be fitted from the data. Equation 5 can be divided into two par… view at source ↗

**Figure 3.** Figure 3: Verification, Unification, and Application of Information Scaling Laws. Panels (a)-(e) demonstrate that information scaling laws hold independently across varying data quality distributions (LQ to MHQ), consistently following power-law trajectories. (f) Illustrates the Information Scaling Laws, where diverse data recipes collapse onto a single curve when mapped to the information quantity metric, confirmin… view at source ↗

**Figure 4.** Figure 4: Cross-Regime Prediction of the Scaling Law. The blue line (Cm′ ) is a pure prediction, generated using parameters fitted only on the Cm data (black line). The fit for the Cm′ points demonstrates our InfoLaw’s power to extrapolate across different overtrain degrees. intercept. This confirms that our proposed Information Scaling Law is effective across different overtrain degrees. Optimizing Data Recipe wit… view at source ↗

**Figure 5.** Figure 5: Supplementary evidence for repetition effects. (a) In the loss–Cm view, repetition induces systematic deviation from a single power-law trend. (b) Heavier repetition leads to slower late-stage improvement and worse final loss, consistent with diminishing returns view at source ↗

**Figure 6.** Figure 6: Validation loss versus downstream performance across benchmarks (ARC-C, ARC-E, HellaSwag, MMLU-Lighteval, TriviaQA) and their average. 14 view at source ↗

**Figure 7.** Figure 7: Comparison of functional fits for λ as a function of N (non-embedding FLOPs/token). The logarithmic form provides the best in-domain fit and extrapolation behavior compared with the exponential and power-law alternatives. Solid lines denote interpolation over observed N; dashed lines indicate extrapolation beyond the observed range view at source ↗

**Figure 8.** Figure 8: Loss and Cm Curve of different LayerMix IST experiments view at source ↗

**Figure 9.** Figure 9: Loss and Cm Curve of different LayerMix LST experiments 16 view at source ↗

**Figure 10.** Figure 10: Case study contrasting data quality. Left (0–5% quality range): coherent, informational, and instructional passages. Right (80–100% quality range): low-information, ad-like content with minimal reasoning or educational value. 17 view at source ↗

**Figure 11.** Figure 11: The fitted quality density function fd on the RefinedWeb dataset. scale N using a logarithmic curve. However, due to the limited number of data points in this verification set (only three distinct model scales), fitting a robust λ(N) − N curve was not feasible. Consequently, we skipped the curve fitting step for λ(N) and directly searched for the optimal λ values corresponding to the specific model sizes … view at source ↗

**Figure 12.** Figure 12: The Unified Information-Loss Scaling Law on the RefinedWeb dataset. 21 view at source ↗

read the original abstract

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InfoLaw fits a two-parameter information model to LLM loss across quality mixtures and repetition levels, with low error on their held-out tests up to 7B, but the forms are calibrated on the same data so extrapolation remains the open question.

read the letter

The paper's core move is to treat pretraining as information accumulation: quality sets how much info each token carries, and repetition adds a scale-dependent drop in returns. They run models at varying sizes, mixture weights, and repetition counts, then fit the density coefficients and decay factor to match observed loss. On recipes and scales they did not use for fitting, the predicted loss stays within 0.15% mean absolute error up to 7B parameters and 425B tokens. That is the concrete advance over plain Chinchilla-style laws, which ignore mixture composition and overtraining effects.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces InfoLaw, a data-aware scaling framework for LLMs that models pretraining loss as information accumulation. Quality controls information density while repetition induces scale-dependent diminishing returns; parameters for these effects are calibrated on collected performance data across scales, quality tiers, and repetition levels. The central claim is that the resulting model predicts loss on unseen data recipes and larger-scale runs (up to 7B parameters and 425B tokens) with 0.15% mean and 0.96% maximum absolute error, enabling reliable data-recipe selection under overtraining where standard scaling laws fail.

Significance. If the low reported errors reflect genuine out-of-sample generalization rather than in-sample fitting, InfoLaw would address a practical gap in LLM training by allowing principled optimization of quality-weighted mixtures under repetition constraints. This could improve compute efficiency in data-limited regimes. The work is credited for explicitly incorporating quality and repetition effects into a scaling model and for attempting extrapolation beyond the training distribution, though the strength of the supporting evidence remains limited by missing experimental details.

major comments (3)

[Abstract] Abstract: the claim that InfoLaw 'predicts performance on unseen data recipes' with 0.15% mean error lacks any description of the validation protocol, including how recipes were partitioned, whether the density and repetition parameters were fitted on the full dataset or held-out subsets, data exclusion rules, or error-bar computation. This is load-bearing for the generalization claim because the model contains free parameters (information density coefficients per quality tier and scale-dependent repetition decay factor) that are calibrated directly on the collected performance data.
[Modeling and results sections] Modeling and results sections: because the functional forms for quality-controlled density and scale-dependent diminishing returns are fitted to the same performance data used for validation, the low error on 'unseen' recipes may reflect interpolation within the fitted parameter space rather than independent extrapolation. No ablation or sensitivity analysis is described that would demonstrate the forms remain accurate for out-of-range mixtures, new architectures, or higher overtraining levels where standard scaling laws already break.
[Extrapolation claims] Extrapolation claims: the reported success on runs up to 7B/425B tokens is presented without evidence that the repetition decay factor was held fixed (rather than re-tuned) when moving beyond the scales used for calibration, nor are confidence intervals or failure cases for the extrapolation provided.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight the need for greater transparency in our validation protocol, supporting analyses, and extrapolation procedure. We agree that these clarifications will strengthen the manuscript and will revise accordingly to address the points raised while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that InfoLaw 'predicts performance on unseen data recipes' with 0.15% mean error lacks any description of the validation protocol, including how recipes were partitioned, whether the density and repetition parameters were fitted on the full dataset or held-out subsets, data exclusion rules, or error-bar computation. This is load-bearing for the generalization claim because the model contains free parameters (information density coefficients per quality tier and scale-dependent repetition decay factor) that are calibrated directly on the collected performance data.

Authors: We agree that the abstract should be self-contained on this point. The parameters were fitted on a training subset of the performance data (smaller scales and a subset of mixture weights), with validation performed on held-out recipes that vary in quality distribution and repetition levels not used in fitting. Partitioning was done by reserving specific mixture compositions and higher repetition counts for testing; data exclusion rules removed certain low-quality tiers from the fitting set to test generalization. Error bars reflect standard deviation across three independent runs per configuration. We will revise the abstract to include a concise description of this protocol. revision: yes
Referee: [Modeling and results sections] Modeling and results sections: because the functional forms for quality-controlled density and scale-dependent diminishing returns are fitted to the same performance data used for validation, the low error on 'unseen' recipes may reflect interpolation within the fitted parameter space rather than independent extrapolation. No ablation or sensitivity analysis is described that would demonstrate the forms remain accurate for out-of-range mixtures, new architectures, or higher overtraining levels where standard scaling laws already break.

Authors: The functional forms were derived from information-theoretic considerations and fitted only on data from scales up to 1B parameters and moderate repetition; held-out validation includes out-of-range mixtures with higher repetition and different quality weightings. However, we acknowledge the absence of a dedicated ablation study or sensitivity analysis for new architectures and extreme overtraining. We will add an ablation subsection showing prediction error on progressively more extrapolated held-out mixtures and discuss limitations for new architectures and higher overtraining regimes. revision: partial
Referee: [Extrapolation claims] Extrapolation claims: the reported success on runs up to 7B/425B tokens is presented without evidence that the repetition decay factor was held fixed (rather than re-tuned) when moving beyond the scales used for calibration, nor are confidence intervals or failure cases for the extrapolation provided.

Authors: The repetition decay factor was calibrated exclusively on the smaller-scale data and applied without re-tuning to the 7B/425B-token runs. This procedure is described in the results section but will be made more explicit with the exact fitted value and a statement that no re-calibration occurred. We will also add confidence intervals derived from the parameter fitting covariance and a short discussion of observed failure modes, such as under-prediction when repetition exceeds the calibrated range. revision: yes

Circularity Check

0 steps flagged

No significant circularity: InfoLaw fits an empirical model to training data then validates predictions on held-out recipes and scales.

full rationale

The paper collects performance measurements across scales, quality distributions, and repetition levels, then posits an information-accumulation functional form (quality sets density; repetition adds scale-dependent diminishing returns) whose two parameters are calibrated to those measurements. It then reports 0.15% mean / 0.96% max absolute loss error on explicitly unseen data recipes and on larger runs (7B models, 425B tokens). Because the reported predictions are extrapolations to held-out points rather than re-statements of the fitted points themselves, and because no self-citation, uniqueness theorem, or ansatz-smuggling step is invoked to justify the functional form, the derivation chain does not reduce to its inputs by construction. The low held-out error constitutes independent empirical support rather than a tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on a small number of fitted parameters for information density per quality level and repetition diminishing factors, plus the core modeling assumption that pretraining equals information accumulation.

free parameters (2)

information density coefficients per quality tier
Quality distribution controls information per token; coefficients are calibrated to match observed performance across mixture experiments.
scale-dependent repetition decay factor
Repetition induces diminishing returns that vary with model size; the functional form and its parameters are fitted to overtraining runs.

axioms (1)

domain assumption Pretraining loss can be modeled as accumulation of information whose density is controlled by data quality and whose marginal gain decreases with repetition in a scale-dependent manner
This is the central modeling premise invoked to derive the predictive equations from the collected performance data.

pith-pipeline@v0.9.0 · 5533 in / 1400 out tokens · 29841 ms · 2026-05-08T18:42:05.015070+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (Jcost), IndisputableMonolith.Foundation.AxiomDischargePlan washburn_uniqueness_aczel / dAlembert_cosh_solution_aczel unclear
I_total = I_i · log(K) · (1 − e^{−λ(N)T/log(K)}); λ(N) = a·ln(N) + b; f_d = e^{−θd}; L = α · info^{−β}
IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction (parameter-free forcing) unclear
We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance.
IndisputableMonolith.Constants (phi, c, ℏ, G as φ-powers) phi_fixed_point / phi_golden_ratio unclear
θ* = 0.922, a* = 0.140, b* = 0.018, α = 3.7373, β = 0.0441 (fitted empirically)

Reference graph

Works this paper leans on

50 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[4]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[5]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

2023
[6]

Few-shot Learning with Multilingual Generative Language Models

Lin, Xi Victoria and Mihaylov, Todor and Artetxe, Mikel and Wang, Tianlu and Chen, Shuohui and Simig, Daniel and et al. Few-shot Learning with Multilingual Generative Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.616

work page doi:10.18653/v1/2022.emnlp-main.616 2022
[7]

2025 , eprint=

Scaling Laws for Downstream Task Performance in Machine Translation , author=. 2025 , eprint=

2025
[8]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

2022
[9]

2023 , eprint=

Are Emergent Abilities of Large Language Models a Mirage? , author=. 2023 , eprint=

2023
[10]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Sardana, Nikhil and Portes, Jacob and Doubov, Sasha and Frankle, Jonathan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[11]

2024 , eprint=

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , author=. 2024 , eprint=

2024
[12]

Scaling Laws for Neural Language Models

Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. CoRR , volume =. 2020 , url =. 2001.08361 , timestamp =

work page internal anchor Pith review arXiv 2020
[13]

2025 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

2025
[14]

2024 , eprint=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

2024
[15]

2023 , eprint=

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , author=. 2023 , eprint=

2023
[16]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

2022
[17]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[18]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Villalobos, Pablo and Ho, Anson and Sevilla, Jaime and Besiroglu, Tamay and Heim, Lennart and Hobbhahn, Marius , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[19]

2024 , eprint=

Scaling Parameter-Constrained Language Models with Quality Data , author=. 2024 , eprint=

2024
[20]

2024 , eprint=

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models , author=. 2024 , eprint=

2024
[21]

2024 , eprint=

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models , author=. 2024 , eprint=

2024
[22]

2025 , eprint=

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , author=. 2025 , eprint=

2025
[23]

2025 , eprint=

AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs , author=. 2025 , eprint=

2025
[24]

2025 , eprint=

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs , author=. 2025 , eprint=

2025
[25]

and Barak, Boaz and Le Scao, Teven and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin , title =

Muennighoff, Niklas and Rush, Alexander M. and Barak, Boaz and Le Scao, Teven and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[26]

2022 , eprint=

Scaling Laws and Interpretability of Learning from Repeated Data , author=. 2022 , eprint=

2022
[27]

Deduplicating training data makes language models better

Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas. Deduplicating Training Data Makes Language Models Better. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.577

work page doi:10.18653/v1/2022.acl-long.577 2022
[28]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus and Barret Zoph and Noam Shazeer , title =. CoRR , volume =. 2021 , url =. 2101.03961 , timestamp =

work page internal anchor Pith review arXiv 2021
[29]

2025 , eprint=

DeepSeek-V3 Technical Report , author=. 2025 , eprint=

2025
[30]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness and Sharan Narang and Newsha Ardalani and Gregory F. Diamos and Heewoo Jun and Hassan Kianinejad and Md. Mostofa Ali Patwary and Yang Yang and Yanqi Zhou , title =. CoRR , volume =. 2017 , url =. 1712.00409 , timestamp =

work page internal anchor Pith review arXiv 2017
[31]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[32]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae and Sebastian Borgeaud and Trevor Cai and Katie Millican and Jordan Hoffmann and H. Francis Song and John Aslanides and Sarah Henderson and Roman Ring and Susannah Young and Eliza Rutherford and et al , title =. CoRR , volume =. 2021 , url =. 2112.11446 , timestamp =

work page internal anchor Pith review arXiv 2021
[33]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and et al , booktitle =. Language Models are Few-Shot Learners , url =
[34]

Language Models are Unsupervised Multitask Learners , url =

Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , biburl =. Language Models are Unsupervised Multitask Learners , url =. OpenAI , keywords =
[35]

, title =

Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and et al. , title =. Journal of Machine Learning Research , year =
[36]

, booktitle =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and et al. , booktitle =. Language Models are Few-Shot Learners , url =
[37]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[38]

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =

Jianlin Su and Murtadha Ahmed and Yu Lu and Shengfeng Pan and Wen Bo and Yunfeng Liu , keywords =. RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , url =

work page doi:10.1016/j.neucom.2023.127063 2024
[39]

2020 , eprint=

GLU Variants Improve Transformer , author=. 2020 , eprint=

2020
[40]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review arXiv
[41]

11 Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Hee- woo Jun, Tom B

Language models scale reliably with over-training and on downstream tasks , author=. arXiv preprint arXiv:2403.08540 , year=

work page arXiv
[42]

arXiv preprint arXiv:2504.16511 , year=

Quadmix: Quality-diversity balanced data selection for efficient llm pretraining , author=. arXiv preprint arXiv:2504.16511 , year=

work page arXiv
[43]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

2018
[44]

Measuring Massive Multitask Language Understanding

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , biburl =. Measuring Massive Multitask Language Understanding. , url =. ICLR , ee =
[45]

Joshi, E

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[46]

H ella S wag: Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[47]

Advances in Neural Information Processing Systems , volume=

Observational scaling laws and the predictability of langauge model performance , author=. Advances in Neural Information Processing Systems , volume=
[48]

arXiv preprint arXiv:2506.13216 , year=

Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law , author=. arXiv preprint arXiv:2506.13216 , year=

work page arXiv
[49]

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi

Regmix: Data mixture as regression for language model pre-training , author=. arXiv preprint arXiv:2407.01492 , year=

work page arXiv
[50]

Advances in Neural Information Processing Systems , volume=

To repeat or not to repeat: Insights from scaling llm under token-crisis , author=. Advances in Neural Information Processing Systems , volume=