arxiv: 2605.09189 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

Christopher M. Bryant , Hao Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords scaling lawslarge language modelsdata-constrained trainingoverfittingcompute allocationmulti-epoch pretrainingperformance extrapolation

0 comments

The pith

A new scaling law form accounts for data limits and repeated training to optimize compute allocation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The dominant scaling laws were calibrated only for data-rich, single-epoch training and fail when unique data is scarce or when models train on the same data multiple times. This work introduces a closed-form extension that decomposes performance loss into separate contributions from insufficient model capacity, insufficient total training, and overfitting from data repetition. The expression stays bounded between the lowest achievable loss and a baseline for uninformed predictions, and it matches earlier laws exactly when data is abundant. If the form holds, it allows direct calculation of the best way to spend a given compute budget on model size versus data volume versus number of epochs. Validation across several model families and language model datasets shows improved accuracy when predicting results at larger scales than those used to fit the parameters.

Core claim

The central proposal is the loss function L(N, D, T) = E + (L_0 - E) h / (1 + h) where h = a/N^α + b/T^β + c N^γ / D^δ. This expression isolates the effects of model size N, total tokens seen T, and unique data D while ensuring the loss saturates at physically meaningful bounds. It recovers the Chinchilla form as a special case and, once fitted to data, supports cost-sensitive optimization of the training configuration.

What carries the argument

The variable h that combines power-law terms in model size, total training tokens, and unique data inside a saturating loss expression.

If this is right

When data costs nothing the optimal allocation matches the Chinchilla recommendation for model size and data volume.
Higher data costs shift the optimum toward smaller unique datasets trained over additional epochs.
The formula yields lower extrapolation error than prior forms on published LLM scaling grids and new multi-epoch experiments.
The same functional form works for MLPs, ResNets, Fourier operators, and transformers across vision, science, and language tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This allocation rule could be used to set data acquisition targets for future large training runs where web-scale unique data is finite.
The separation of terms offers a way to attribute performance gains to changes in model size, training duration, or data diversity.
Testing the form on reinforcement learning or multimodal models would check whether the same decomposition applies beyond supervised pretraining.

Load-bearing premise

The chosen decomposition of loss into undercapacity, undertraining, and overfitting terms with power-law dependencies continues to describe behavior accurately at scales and data regimes beyond those tested.

What would settle it

A direct test would be to measure validation loss for a set of models trained at compute levels several times larger than the fitting set, with controlled variation in unique data size and epoch count, and verify agreement with the formula's predictions.

Figures

Figures reproduced from arXiv: 2605.09189 by Christopher M. Bryant, Hao Liu.

**Figure 2.** Figure 2: Empirical MNIST running-min validation-loss surface across a selection of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Observed vs. predicted loss for our form (leftmost column) and the three closest competitors, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Observed vs. predicted loss on the Far-Farseer and TinyStories datasets across four forms (Chinchilla, Farseer, our vanilla form, and our corrected form), on benchmarks where vanilla E-collapse is measurable. Top row (“Far-Farseer”): Farseer’s published compute-extrapolation benchmark (held-out targets in orange, in-cohort cells in blue). Bottom row: TinyStories high-C holdout (orange = held-out cells). Co… view at source ↗

**Figure 5.** Figure 5: MNIST empirical running-min validation-loss surface across all [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗

**Figure 6.** Figure 6: CIFAR-100 empirical running-min validation-loss surface. [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: Darcy empirical running-min validation-loss surface. [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗

**Figure 8.** Figure 8: TinyStories empirical running-min validation-loss surface. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: High-C. Observed vs. predicted loss on the four training experiments (MNIST, CIFAR-100, Darcy, TinyStories) under the high-C holdout. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗

**Figure 10.** Figure 10: High-D. Observed vs. predicted loss on the four training experiments under the high-D holdout. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗

**Figure 11.** Figure 11: High-C. Observed vs. predicted loss on the five published LLM grids (Chinchilla, Muennighoff, Gadre, Porian, Farseer) under the high-C holdout. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_11.png] view at source ↗

**Figure 12.** Figure 12: High-D. Observed vs. predicted loss on the five published LLM grids under the high-D holdout. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_12.png] view at source ↗

read the original abstract

The scaling laws guiding modern model training were calibrated for a single regime: data-rich, single-epoch pretraining. The dominant such scaling law form, Chinchilla's $L = E + A/N^\alpha + B/D^\beta$, has three structural limitations outside that regime: it diverges as unique data shrinks instead of saturating at the uninformed baseline; it cannot represent overfitting when capacity exceeds the data; and it conflates total examples seen with unique examples available. We propose a closed-form extension, $L(N, D, T) = E + (L_0 - E)\,h/(1+h)$ with $h = a/N^\alpha + b/T^\beta + c\,N^\gamma/D^\delta$, that decomposes loss into undercapacity, undertraining, and overfitting terms. It saturates between the irreducible loss $E$ and an uninformed baseline $L_0$ fixed by the loss type, and reduces to Chinchilla in the data-rich, single-epoch limit. We validate it on four multi-epoch experiments spanning four architecture families (MLPs, ResNets, Fourier neural operators, and transformers) across vision, scientific ML, and language domains, and refit it to five published LLM scaling-law grids. Extrapolating to higher compute and larger unique data than seen at fit time, our form achieves state-of-the-art RMSE on every published LLM grid we evaluate and on most cells of our constructed experiments. Once calibrated, the form admits a cost-aware allocation that recovers Chinchilla's optimum when data is free and shifts toward smaller corpora and more epochs as data grows expensive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This gives a practical extension to Chinchilla laws for multi-epoch and data-limited regimes, with decent empirical fits but unproven extrapolation.

read the letter

The main thing to know is that the authors propose a closed-form scaling law that adds saturation at an uninformed baseline and an explicit overfitting term to handle repeated data. It reduces to the standard Chinchilla form when data is abundant and single-epoch, which is a clean sanity check. They then show it fits better than prior versions on refits to published LLM grids and on their own multi-epoch runs across MLPs, ResNets, Fourier operators, and transformers in vision, science, and language tasks. The cost-aware allocation that follows from it is the part practitioners will actually use: it tells you when to stop collecting new data and start repeating what you have as data prices rise. That is genuinely useful for real training budgets. The experiments are straightforward and cover enough domains to make the results worth looking at. The refits achieve lower RMSE than the baselines they compare against, including on some cells outside the original fit range. The soft spot is that the functional form itself is empirical rather than derived, and the claimed extrapolation advantage is measured on the same grids used to tune the free parameters. There is moderate circularity there, and no strong argument that the specific additive structure inside h or the absence of cross terms will survive at much larger N, D, or T. The stress-test concern about generalization outside the tested ranges is reasonable; the paper does not provide independent verification at scales far beyond the fitting data. This is for people who actually run large training jobs and need to decide data acquisition versus repetition. Readers optimizing compute under data constraints will get concrete value from the allocation rule. It deserves a serious referee because it addresses a real and growing practical gap with multiple experiments and refits, even though the generalization claims will need scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper proposes a closed-form extension to Chinchilla scaling laws, L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ, to handle data-constrained regimes by decomposing loss into undercapacity, undertraining, and overfitting terms. It saturates at an uninformed baseline L0, reduces to Chinchilla when T=D and the overfitting term vanishes, and is validated on multi-epoch experiments across four architecture families plus refits to five published LLM grids, claiming superior RMSE on extrapolations to higher compute and unique data.

Significance. If the functional form and its decomposition generalize, the work would be significant for practical scaling in data-limited settings by enabling cost-aware allocation of compute that shifts toward more epochs on smaller corpora as data becomes expensive. The explicit reduction to Chinchilla, saturation property, and validation across vision, scientific ML, and language domains are clear strengths.

major comments (2)

[Abstract] Abstract, proposed equation: The headline claim of state-of-the-art RMSE on extrapolation to higher compute and larger unique data than seen at fit time is evaluated after fitting parameters to the same grids; this creates a circularity risk for the extrapolation advantage, as the specific additive structure inside h and the nine free parameters (E, L0, a, b, c, α, β, γ, δ) are chosen empirically rather than derived.
[Validation experiments] Validation on published LLM grids: While the form recovers Chinchilla in the appropriate limit and improves RMSE on the tested grids, the central extrapolation claim rests on post-fit performance within or near the fitting range; independent verification on new architectures, data distributions, or substantially larger regimes (beyond the four families and five grids) is needed to support generalization of the decomposition.

minor comments (1)

The nine free parameters and their roles in the functional form would benefit from a dedicated table or explicit listing in the methods section for reader clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with point-by-point responses. Revisions have been made to clarify the extrapolation procedure and to discuss validation limitations.

read point-by-point responses

Referee: [Abstract] Abstract, proposed equation: The headline claim of state-of-the-art RMSE on extrapolation to higher compute and larger unique data than seen at fit time is evaluated after fitting parameters to the same grids; this creates a circularity risk for the extrapolation advantage, as the specific additive structure inside h and the nine free parameters (E, L0, a, b, c, α, β, γ, δ) are chosen empirically rather than derived.

Authors: We appreciate the concern about potential circularity. The extrapolation procedure fits parameters only on subsets corresponding to lower compute budgets and smaller unique data sizes, then evaluates RMSE on held-out points with higher compute and larger unique data. This split is described in the validation sections. The functional form is empirical but deliberately constructed to enforce saturation between E and L0 and exact reduction to the Chinchilla law when T = D and the overfitting term vanishes; these constraints provide structure beyond pure curve-fitting. We have revised the abstract and added an explicit paragraph in the methods section detailing the train/test splits and fitting protocol to eliminate ambiguity. The nine parameters are fitted per dataset, consistent with other scaling-law literature. revision: partial
Referee: [Validation experiments] Validation on published LLM grids: While the form recovers Chinchilla in the appropriate limit and improves RMSE on the tested grids, the central extrapolation claim rests on post-fit performance within or near the fitting range; independent verification on new architectures, data distributions, or substantially larger regimes (beyond the four families and five grids) is needed to support generalization of the decomposition.

Authors: We agree that broader independent verification would strengthen claims of generalization. The current validation covers multi-epoch runs across four architecture families (MLPs, ResNets, Fourier neural operators, transformers) in vision, scientific ML, and language, plus refits to five published LLM grids. These already exceed the scope of most prior scaling-law studies. We acknowledge, however, that the tested regimes do not include entirely new architectures or substantially larger scales beyond the available published grids. We have added a limitations subsection in the discussion that explicitly notes the current validation scope and outlines the need for future tests on larger regimes and new distributions. revision: partial

standing simulated objections not resolved

Independent verification on new architectures, data distributions, or substantially larger regimes beyond the four families and five grids

Circularity Check

0 steps flagged

No significant circularity in the proposed scaling-law extension

full rationale

The paper proposes an empirical closed-form extension to Chinchilla scaling that is deliberately constructed to recover the original law in the data-rich single-epoch limit and to saturate at an uninformed baseline; this is a design choice, not a derivation that collapses to its inputs. Parameters are fitted to held-out subsets of multi-epoch experiments and published LLM grids, with performance claims tied to extrapolation beyond the fitting range (higher compute and unique data). No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked. The central result is therefore an independent functional ansatz whose predictive advantage is measured on data points outside the fit, satisfying the requirement for self-contained empirical validation rather than any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that loss can be decomposed into three additive regimes via the chosen saturating functional form; this form and its parameters are fitted rather than derived from first principles.

free parameters (1)

E, L0, a, b, c, alpha, beta, gamma, delta
All parameters in the proposed L(N,D,T) equation are fitted to experimental or published data grids.

axioms (1)

domain assumption Loss can be expressed as a saturating function of undercapacity, undertraining, and overfitting terms that reduces to Chinchilla form in the data-rich single-epoch limit.
Invoked in the proposal of the closed-form extension and its claimed reduction property.

pith-pipeline@v0.9.0 · 5601 in / 1576 out tokens · 35630 ms · 2026-05-12T02:53:37.878631+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L(N, D, T) = E + (L0 − E) h/(1 + h) with h = a/N^α + b/T^β + c N^γ/D^δ ... decomposes loss into undercapacity, undertraining, and overfitting terms
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

recovers Chinchilla in the data-rich, single-epoch limit ... cost-aware allocation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 3 internal anchors

[1]

Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai

Ibrahim M. Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[2]

Claude Code: Anthropic’s agentic coding system

Anthropic. Claude Code: Anthropic’s agentic coding system. https://www.anthropic. com/product/claude-code, 2026

work page 2026
[3]

Claude Opus 4.7 System Card

Anthropic. Claude Opus 4.7 System Card. https://www.anthropic.com/ claude-opus-4-7-system-card, April 2026

work page 2026
[4]

Fluid intelligence: A forward look on AI foundation models in computational fluid dynamics.arXiv preprint arXiv:2511.20455, 2025

Neil Ashton, Johannes Brandstetter, and Siddhartha Mishra. Fluid intelligence: A forward look on AI foundation models in computational fluid dynamics.arXiv preprint arXiv:2511.20455, 2025

work page arXiv 2025
[5]

Explaining neural scaling laws , volume=

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27), 2024. doi: 10.1073/pnas.2311878121

work page doi:10.1073/pnas.2311878121 2024
[6]

Tran, and Mehran Kazemi

Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, and Mehran Kazemi. Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling.arXiv preprint arXiv:2408.16737, 2024

work page arXiv 2024
[7]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020
[8]

doi:10.1073/pnas.1903070116 , year = 2019, month =

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias-variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116

work page doi:10.1073/pnas.1903070116 2019
[9]

arXiv:2404.10102 , year=

Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. Chinchilla scaling: A replication attempt.arXiv preprint arXiv:2404.10102, 2024

work page arXiv 2024
[10]

A dynamical model of neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[11]

Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

Ari Brill. Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

work page arXiv 2024
[12]

Broken neural scaling laws

Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[13]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramèr, and Chiyuan Zhang. Quantifying memorization across neural language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[14]

A hitchhiker’s guide to scaling law estimation

Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[15]

Tinystories: How small can language models be and still speak coherent english?

Ronen Eldan and Yuanzhi Li. TinyStories: How small can language models be and still speak coherent English?arXiv preprint arXiv:2305.07759, 2023

work page arXiv 2023
[16]

Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, and Ludwig Schmidt

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Ni...

work page 2025
[17]

Scaling laws and compute-optimal training beyond fixed training durations

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro V on Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 10

work page 2024
[18]

Tibshirani

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of Statistics, 50(2):949–986, 2022

work page 2022
[19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InConference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016

work page 2016
[20]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. InEuropean Conference on Computer Vision (ECCV), pages 630–645, 2016

work page 2016
[21]

Early stopping in deep networks: Double descent and how to eliminate it

Reinhard Heckel and Fatih Furkan Yilmaz. Early stopping in deep networks: Double descent and how to eliminate it. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[22]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010....

work page internal anchor Pith review arXiv 2010
[23]

Scaling laws for transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293, 2021

work page arXiv 2021
[24]

Danny Hernandez, Tom B. Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El- Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from repeated data.arXiv preprint arXiv...

work page arXiv 2022
[25]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review arXiv 2017
[26]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page 2022
[27]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[28]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan. Scaling laws for precision. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[29]

Farseer: A refined scaling law in large language models

Houyi Li, Wenzhen Feng, Qiufeng Hu, Zili Zhou, Shuigeng Zhang, Haoyu Xu, Xiangyu Zhang, Jinyang Jiao, Peng Wang, Jing Liu, Xiaolong Jin, Zhi-Hua Ling, Yi Zhang, and Zhiyuan Fan. Farseer: A refined scaling law in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[30]

(Mis)Fitting: A survey of scaling laws

Margaret Li, Sneha Kudugunta, and Luke Zettlemoyer. (Mis)Fitting: A survey of scaling laws. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[31]

Fourier neural operator for parametric partial differen- tial equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differen- tial equations. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[32]

arXiv preprint arXiv:2510.06548 , year=

Seng Pei Liew and Takuya Kato. Reusing overtrained language models saturates scaling.arXiv preprint arXiv:2510.06548, 2025. 11

work page arXiv 2025
[33]

Can language models discover scaling laws?arXiv preprint arXiv:2507.21184, 2025

Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xiangyu Wang, Jianzhu Ma, Yitao Liang, and James Zou. Can language models discover scaling laws?arXiv preprint arXiv:2507.21184, 2025

work page arXiv 2025
[34]

arXiv preprint arXiv:2210.16859 , year=

Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

work page arXiv 2022
[35]

The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

work page 2022
[36]

Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[37]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021. doi: 10.1088/1742-5468/ac3a74

work page doi:10.1088/1742-5468/ac3a74 2021
[38]

Reconciling Kaplan and Chinchilla scaling laws.Transactions on Machine Learning Research, 2024

Tim Pearce and Jinyeop Song. Reconciling Kaplan and Chinchilla scaling laws.Transactions on Machine Learning Research, 2024

work page 2024
[39]

Scaling laws for pre-training agents and world models

Tim Pearce, Tabish Rashid, Dave Bignell, Raluca Georgescu, Sam Devlin, and Katja Hofmann. Scaling laws for pre-training agents and world models. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025
[40]

Data scaling laws for radiology foundation models.arXiv preprint arXiv:2509.12818, 2025

Chantal Pellegrini et al. Data scaling laws for radiology foundation models.arXiv preprint arXiv:2509.12818, 2025

work page arXiv 2025
[41]

Resolv- ing discrepancies in compute-optimal scaling of language models

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolv- ing discrepancies in compute-optimal scaling of language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[42]

Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit

Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[43]

Beyond Chinchilla- Optimal: Accounting for inference in language model scaling laws

Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond Chinchilla- Optimal: Accounting for inference in language model scaling laws. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[44]

Mahoney, and Amir Gholami

Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael W. Mahoney, and Amir Gholami. Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[45]

PDEBench: An extensive benchmark for scientific machine learning

Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. PDEBench: An extensive benchmark for scientific machine learning. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2022

work page 2022
[46]

Markosyan, Luke Zettlemoyer, and Armen Aghajanyan

Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memo- rization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[47]

Position: Will we run out of data? limits of LLM scaling based on human-generated data

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of LLM scaling based on human-generated data. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[48]

Scaling vision transform- ers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[49]

1−e−h wrapper

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetuning: The effect of data, model and finetuning method. InInternational Conference on Learning Representations (ICLR), 2024. 12 Appendix Contents A The uninformed baselineL 0 15 B Limit verification 15 C Chinchilla recovery 16 D Choice of saturating wrapper 17 E Remarks o...

work page 2024
[50]

Calibrate the form on a domain-specific training grid to obtain (E, a, b, c, α, β, γ, δ) and L0 (Section 3)

work page
[51]

Specify the price pair (ρD, ρC) for the deployment setting and an architecture-dependent FLOP-per-step constantk

work page
[52]

Solve Equation (10) (for P1) or (11) (for P2) jointly with the active constraint over(N, D, T) until the stationarity conditions hold to tolerance

work page
[53]

Far-Farseer

Read off the dual quantities: B∗ and (N ∗, D∗, T ∗, T ∗/D∗) for P1, or L∗ and the same allocation for P2. Cost ratios and the data-constrained regime.A useful summary statistic is the dollar-cost ratio η=ρ D/ρC. For LLM pretraining on web-scraped data η is small (data is cheap, compute domi- nates) and the Chinchilla-style optimum approximates both (P1) a...

work page arXiv
[54]

Adding D= 20 lets Muennighoff edge ahead, and adding D= 10 flips the comparison cleanly (ours 0.126±0.004 vs

Sweep A (contamination threshold).Restricting training to D≥50 keeps ours competitive with Muennighoff on theD= 60,000 holdout (RMSE inside bootstrap CIs throughk= 5–8 , with ours winning cleanly at k= 6 ). Adding D= 20 lets Muennighoff edge ahead, and adding D= 10 flips the comparison cleanly (ours 0.126±0.004 vs. Muennighoff 0.109±0.001). The contaminat...

work page
[55]

extended

Sweep B (range of over-extrapolation).With the smallest- D cells fixed in training, ours’ held-out MBE is +0.17 to +0.21 on every held-out target from D= 1,000 to D= 31,600 (four orders of magnitude); Chinchilla and Muennighoff hold at [+0.02,+0.04] and [+0.03,+0.07] respectively. The over-prediction collapses only when training contains neighbors of the ...

work page 1942