Validity Threats for Foundation Model Research

Gunnar K\"onig; Martin Pawelczyk; Sebastian Bordt; Ulrike von Luxburg

arxiv: 2606.05029 · v1 · pith:LOOPTAP4new · submitted 2026-06-03 · 💻 cs.LG · cs.CL

Validity Threats for Foundation Model Research

Gunnar K\"onig , Martin Pawelczyk , Ulrike von Luxburg , Sebastian Bordt This is my paper

Pith reviewed 2026-06-28 06:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords validity threatsfoundation modelscausal inferenceproxy experimentsobservational studiesscaling lawsresearch designmachine learning

0 comments

The pith

Savings in compute for foundation model research come at the cost of validity threats that can invalidate claims.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that full controlled experiments on foundation models have become too expensive, leading researchers to rely on approximations such as proxy experiments, scaling laws, observational studies with public models, and single-run designs that exploit internal variation. These cheaper strategies save resources but introduce hidden and sometimes untestable assumptions whose violation can undermine research conclusions. The authors frame foundation model research as a causal inference problem and evaluate the strategies using four validity types drawn from empirical social sciences: statistical, internal, external, and construct validity. Each approach shows a distinct profile, for instance proxy experiments gaining statistical and internal validity while losing external and construct validity. The resulting toolkit lets researchers identify which threats apply to their chosen design.

Core claim

Savings in compute come at the cost of validity threats—hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. The evaluation framework casts foundation model research as a causal inference problem and evaluates different research strategies through four types of validity adapted from the empirical social sciences: statistical, internal, external, and construct validity. Proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats th

What carries the argument

An evaluation framework that treats foundation model research as a causal inference problem and assesses each research strategy through four validity types (statistical, internal, external, construct) imported from empirical social sciences.

If this is right

Proxy experiments trade external and construct validity for statistical and internal validity.
Observational studies face confounding and effect heterogeneity.
Single-run designs are strained by interference between treated units.
Several validity threats have received insufficient attention in the literature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be used to score specific published papers and flag which validity threats remain unaddressed in each.
Combining two approximation strategies might offset some weaknesses of each, though the paper does not test this possibility.
Widespread adoption would likely shift publication incentives toward designs that explicitly document their validity profile.

Load-bearing premise

The four validity types imported from empirical social sciences can be applied to foundation-model research strategies without substantial redefinition or loss of diagnostic power.

What would settle it

A side-by-side test in which conclusions drawn from a proxy experiment or observational study on a foundation model are directly compared against results from a full controlled experiment on the identical question; systematic disagreement would support the claimed validity threats while consistent agreement would falsify them.

Figures

Figures reproduced from arXiv: 2606.05029 by Gunnar K\"onig, Martin Pawelczyk, Sebastian Bordt, Ulrike von Luxburg.

**Figure 1.** Figure 1: Foundation model research as a causal inference problem. To understand and improve foundation models, we need to answer questions about how changes to a training recipe affect outcomes of interest—for example, how adding math data affects reasoning benchmark scores. At frontier scale, the ideal experiment that identifies this treatment effect must be approximated. An evaluation framework for foundation mod… view at source ↗

**Figure 2.** Figure 2: The proxy approach. We conduct an experiment but replace one or several parts with proxies ( ) to make it computationally feasible. For example, we conduct the experiment with a proxy model (1B instead of 90B parameters), a proxy treatment (fine-tuning instead of pre-training on math data), or a proxy outcome (validation loss instead of task performance). learning rate are adjusted with scale. This means t… view at source ↗

**Figure 3.** Figure 3: The observational approach. Instead of running new experiments, we can leverage the publicly available meta-data of existing training runs. This data is observational, meaning that we cannot perform treatments and observe their effects. Instead we have to use the statistical relationships between recipe components and outcomes in combination with (untestable) causal assumptions to identify treatment effect… view at source ↗

**Figure 4.** Figure 4: Analyzing a single training run. Single-run analysis emulates multiple training runs (left) from one (right). Here, documents are regarded as independent “units”, their assignment to the training set as the “treatment”, and their respective loss as the “outcome”. Validity threats. • Due to confounding and selection bias, internal validity is a key threat for the observational approach, requiring strong, un… view at source ↗

**Figure 5.** Figure 5: We illustrated the case of confounding in detail in Section 4. As follows we discuss selection bias. While confounding concerns the data-generating process itself, selection bias arises from the process by which observations are included in the sample.3 For example, Singh et al. [92] show that some labs evaluate many model variants on private benchmark data before publishing the best score. Conditioning on… view at source ↗

read the original abstract

Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps common low-compute strategies in foundation model work to four social-science validity threats and gives each strategy a characteristic profile, but the transfer of those validity concepts needs more justification.

read the letter

The main takeaway is that proxy experiments, observational studies on public models, and single-run designs each trade off different kinds of validity when they stand in for full controlled experiments on large foundation models. The paper spells out those trade-offs by adapting statistical, internal, external, and construct validity from empirical social science.

What is new is the explicit assignment of profiles: proxies are said to buy statistical and internal validity at the expense of external and construct validity; observational studies run into confounding and effect heterogeneity; single-run designs face interference between units. The abstract and the described framework make these mappings concrete rather than leaving them as general warnings.

The paper does a clean job of showing there is no free lunch and of naming threats such as interference that get less airtime in current ML discussions. The logic is straightforward and the structure is easy to follow.

The soft spot is the assumption that the four validity types carry over without substantial redefinition. Foundation model training involves scale-dependent emergence and black-box dynamics that social-science definitions were not built for, and the paper does not walk through concrete cases to show the framework catches problems that standard ML critiques would miss. Without that demonstration the profiles risk staying at the level of plausible labels.

This is for researchers who design or review foundation model experiments on tight compute budgets and want a checklist for validity threats. It is worth sending to peer review because the underlying concern is real and timely, even if the framework would benefit from tighter grounding in the FM setting.

Referee Report

1 major / 1 minor

Summary. The manuscript argues that cost-saving approximations to controlled experiments in foundation model research—proxy experiments, observational studies, and single-run designs—introduce validity threats in the form of hidden assumptions that can invalidate claims. It proposes an evaluation framework that recasts foundation model research as a causal inference problem and applies four validity types (statistical, internal, external, and construct) adapted from empirical social sciences to produce characteristic validity profiles for each strategy, revealing several insufficiently addressed threats.

Significance. If the adaptation holds, the framework supplies a practical, structured toolkit for scrutinizing validity in compute-constrained FM research. It explicitly maps strategies to threats such as confounding and effect heterogeneity in observational studies and interference in single-run designs, potentially improving the reliability of claims that rely on these approximations.

major comments (1)

[Abstract] Abstract, paragraph introducing the evaluation framework: the claim that the four validity types transfer from empirical social sciences without substantial redefinition or loss of diagnostic power is load-bearing for the central argument that the framework provides a practical toolkit. The manuscript does not supply a detailed mapping or concrete demonstration that these criteria surface threats beyond those already identified in the ML literature on scaling laws and proxy tasks.

minor comments (1)

[Abstract] The abstract states that the analysis 'reveals several validity threats that have received insufficient attention' but does not enumerate them; adding an explicit list or summary table early in the paper would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the positive assessment of the framework's potential utility. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph introducing the evaluation framework: the claim that the four validity types transfer from empirical social sciences without substantial redefinition or loss of diagnostic power is load-bearing for the central argument that the framework provides a practical toolkit. The manuscript does not supply a detailed mapping or concrete demonstration that these criteria surface threats beyond those already identified in the ML literature on scaling laws and proxy tasks.

Authors: We agree that an explicit demonstration of the transfer is important for the claim. The manuscript adapts the four validity definitions directly from the cited social-science sources (e.g., Shadish, Cook & Campbell) without redefinition and then applies them in Sections 3–5 to derive strategy-specific threat profiles. These applications identify threats such as interference between treated units in single-run designs and effect heterogeneity in observational studies of public checkpoints—issues that receive little attention in the scaling-law and proxy-task literature. To make the mapping more transparent and to strengthen the load-bearing claim, we will add a summary table in the revised manuscript that cross-tabulates each validity criterion against the three research strategies, with explicit contrasts to existing ML discussions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct application of external social-science concepts with no self-referential derivations or fitted inputs

full rationale

The paper contains no equations, fitted parameters, or derivations. It proposes an evaluation framework by directly adapting four established validity types (statistical, internal, external, construct) from empirical social sciences to foundation-model research strategies, without any reduction of claims to self-citations, self-definitions, or renamed inputs. The central premise—that compute-saving strategies introduce validity threats—is presented as an application of external concepts rather than a result derived from the paper's own prior outputs. This is the most common honest non-finding for conceptual framework papers that import independent external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on the domain assumption that foundation-model research can be usefully cast as a causal-inference problem and that the four validity types transfer without major adaptation. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Foundation model research strategies can be treated as attempts to perform causal inference.
Stated in the sentence that introduces the evaluation framework.
domain assumption The four validity types from empirical social sciences apply directly to ML research designs.
Invoked when the paper evaluates each strategy through statistical, internal, external, and construct validity.

pith-pipeline@v0.9.1-grok · 5766 in / 1285 out tokens · 20970 ms · 2026-06-28T06:50:23.663030+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

116 extracted references · 49 canonical work pages · 12 internal anchors

[1]

One-shot empirical privacy estimation for federated learning

Galen Andrew, Peter Kairouz, Sewoong Oh, Alina Oprea, H Brendan McMahan, and Vinith M Suriyakumar. One-shot empirical privacy estimation for federated learning. InInternational Conference on Learning Representations, 2024

2024
[2]

If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 35:17953–17967, 2022

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 35:17953–17967, 2022

2022
[3]

Training data attribution via approximate unrolled differentiation.arXiv preprint arXiv:2405.12186, 2024

Juhan Bae, Wu Lin, Jonathan Lorraine, and Roger Grosse. Training data attribution via approximate unrolled differentiation.arXiv preprint arXiv:2405.12186, 2024

work page arXiv 2024
[4]

Measuring what matters: Construct validity in large language model benchmarks

Andrew M Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, et al. Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[5]

Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv:2505.13738, 2025

Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv:2505.13738, 2025

work page arXiv 2025
[6]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

2023
[8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Train once, answer all: Many pretraining experiments for the cost of one

Sebastian Bordt and Martin Pawelczyk. Train once, answer all: Many pretraining experiments for the cost of one. InICLR, 2026

2026
[10]

Data taggants: Dataset owner- ship verification via harmless targeted data poisoning

Wassim Bouaziz, Nicolas Usunier, and El-Mahdi El-Mhamdi. Data taggants: Dataset owner- ship verification via harmless targeted data poisoning. 2025

2025
[11]

Accounting for variance in machine learning benchmarks.Proceedings of machine learning and systems, 2021

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram V oleti, et al. Accounting for variance in machine learning benchmarks.Proceedings of machine learning and systems, 2021

2021
[12]

Samuel Bowman and George Dahl. What will it take to fix benchmarking in natural lan- guage understanding? InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, 2021

2021
[13]

Language models are few-shot learners.NeurIPS, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS, 2020

2020
[14]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021. 10

2021
[15]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

2024
[16]

A hitchhiker’s guide to scaling law estimation.arXiv preprint arXiv:2410.11840, 2024

Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation.arXiv preprint arXiv:2410.11840, 2024

work page arXiv 2024
[17]

Provably robust watermarks for open-source language models.TMLR, 2026

Miranda Christ, Sam Gunn, Tal Malkin, and Mariana Raykova. Provably robust watermarks for open-source language models.TMLR, 2026

2026
[18]

The variety-of-evidence thesis: A bayesian exploration of its surprising failures.Synthese, 2019

François Claveau and Olivier Grenier. The variety-of-evidence thesis: A bayesian exploration of its surprising failures.Synthese, 2019

2019
[19]

The rising costs of training frontier AI models.arXiv preprint arXiv:2405.21015, 2024

Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024

work page arXiv 2024
[20]

The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

work page arXiv 2021
[21]

Constat: Performance-based contamina- tion detection in large language models.Advances in Neural Information Processing Systems, 37:92420–92464, 2024

Jasper Dekoninck, Mark Müller, and Martin Vechev. Constat: Performance-based contamina- tion detection in large language models.Advances in Neural Information Processing Systems, 37:92420–92464, 2024

2024
[22]

Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023

Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023

work page arXiv 2023
[23]

Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

work page arXiv 2024
[24]

Gaussian differential privacy.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022

Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022

2022
[25]

Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

work page arXiv 2025
[26]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024
[27]

Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

work page arXiv 2022
[28]

Data on AI models, 3 2026

Epoch AI. Data on AI models, 3 2026. URL https://epoch.ai/data/ai-models. Ac- cessed: 21 Mar 2026

2026
[29]

What neural networks memorize and why: Discovering the long tail via influence estimation.Advances in Neural Information Processing Systems, 33: 2881–2891, 2020

Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation.Advances in Neural Information Processing Systems, 33: 2881–2891, 2020

2020
[30]

Open llm leaderboard v2

Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2024

2024
[31]

The benchmarking epistemology: Construct validity for evaluating machine learning models.arXiv preprint arXiv:2510.23191, 2025

Timo Freiesleben and Sebastian Zezulka. The benchmarking epistemology: Construct validity for evaluating machine learning models.arXiv preprint arXiv:2510.23191, 2025

work page arXiv 2025
[32]

A swiss army infinitesimal jackknife

Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick. A swiss army infinitesimal jackknife. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 1139–1147. PMLR, 2019. 11

2019
[33]

Scaling laws for data filtering–data curation cannot be compute agnostic

Sachin Goyal, Pratyush Maini, Zachary C Lipton, Aditi Raghunathan, and J Zico Kolter. Scaling laws for data filtering–data curation cannot be compute agnostic. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22702–22711, 2024

2024
[34]

Olmo: Accelerating the science of language models

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15789–15809, 2024

2024
[35]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

work page arXiv 2023
[36]

On the surprising effectiveness of large learning rates under standard width scaling.arXiv preprint arXiv:2505.22491, 2025

Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, and Leena Chennuru Vankadara. On the surprising effectiveness of large learning rates under standard width scaling.arXiv preprint arXiv:2505.22491, 2025

work page arXiv 2025
[37]

The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

Frank R Hampel. The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

1974
[38]

The emerging science of machine learning benchmarks.Manuscript

Moritz Hardt. The emerging science of machine learning benchmarks.Manuscript. https://mlbenchmarks.org, 2025

2025
[39]

Using big data to emulate a target trial when a randomized trial is not available.American journal of epidemiology, 183(8):758–764, 2016

Miguel A Hernán and James M Robins. Using big data to emulate a target trial when a randomized trial is not available.American journal of epidemiology, 183(8):758–764, 2016

2016
[40]

Position: Why we must rethink empirical research in machine learning

Moritz Herrmann, F Julian D Lange, Katharina Eggensperger, Giuseppe Casalicchio, Marcel Wever, Matthias Feurer, David Rügamer, Eyke Hüllermeier, Anne-Laure Boulesteix, and Bernd Bischl. Position: Why we must rethink empirical research in machine learning. In International Conference on Machine Learning (ICML), 2024

2024
[41]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Algorithmic progress in language models

Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan C Guo, David Atkinson, Neil Thompson, and Jaime Sevilla. Algorithmic progress in language models. Advances in Neural Information Processing Systems, 37:58245–58283, 2024

2024
[43]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 375–385. ACM, 2021. doi: 10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021
[46]

Measuring forgetting of memorized training examples

Matthew Jagielski, Om Thakkar, Florian Tramer, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Guha Thakurta, Nicolas Papernot, and Chiyuan Zhang. Measuring forgetting of memorized training examples. InICLR, 2023

2023
[47]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[48]

How well can differential privacy be audited in one run?arXiv preprint arXiv:2503.07199, 2025

Amit Keinan, Moshe Shenfeld, and Katrina Ligett. How well can differential privacy be audited in one run?arXiv preprint arXiv:2503.07199, 2025. 12

work page arXiv 2025
[49]

Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025

Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025

work page arXiv 2025
[50]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

2017
[51]

Predicting llm reasoning performance with small proxy model

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, and Jamin Shin. Predicting llm reasoning performance with small proxy model. 2026

2026
[52]

Revisiting the scaling properties of downstream metrics in large language model training

Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, and Jason Ramapuram. Revisiting the scaling properties of downstream metrics in large language model training. arXiv preprint arXiv:2512.08894, 2025

work page arXiv 2025
[53]

Gaussian membership inference privacy.Advances in Neural Information Processing Systems, 36:73866–73878, 2023

Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Gaussian membership inference privacy.Advances in Neural Information Processing Systems, 36:73866–73878, 2023

2023
[54]

Causal estimation of memorisation profiles.arXiv preprint arXiv:2406.04327, 2024

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, and Tiago Pimentel. Causal estimation of memorisation profiles.arXiv preprint arXiv:2406.04327, 2024

work page arXiv 2024
[55]

Causal estimation of tokenisation bias.arXiv preprint arXiv:2506.03149, 2025

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, and Tiago Pimentel. Causal estimation of tokenisation bias.arXiv preprint arXiv:2506.03149, 2025

work page arXiv 2025
[56]

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

Bojie Li. Incompressible knowledge probes: Estimating black-box llm parameter counts via factual capacity, 2026. URLhttps://arxiv.org/abs/2604.24827

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, and Graham Horwood. Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

work page arXiv 2024
[58]

Do influence functions work on large language models.arXiv preprint arXiv:2409.19998, 3, 2024

Zhe Li, Wei Zhao, Yige Li, and Jun Sun. Do influence functions work on large language models.arXiv preprint arXiv:2409.19998, 3, 2024

work page arXiv 2024
[59]

Are we learning yet? a meta review of evaluation failures across machine learning

Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021
[60]

Not-just- scaling laws: Towards a better understanding of the downstream impact of language model design decisions

Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, et al. Not-just- scaling laws: Towards a better understanding of the downstream impact of language model design decisions. InProceedings of the 2025 Conference on Empirical Methods in Natural Langu...

2025
[61]

Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi, et al. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. InICLR, 2026

2026
[62]

Scaling laws are unreliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885, 2025

Nicholas Lourie, Michael Y Hu, and Kyunghyun Cho. Scaling laws are unreliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885, 2025

work page arXiv 2025
[63]

Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pon- tus Stenetorp, Sharan Narang, and Dieuwke Hupkes

Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pon- tus Stenetorp, Sharan Narang, and Dieuwke Hupkes. Quantifying variance in evaluation benchmarks.arXiv preprint arXiv:2406.10229, 2024

work page arXiv 2024
[64]

Datadecide: How to predict best pretraining data with small experiments.arXiv preprint arXiv:2504.11393, 2025

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, et al. Datadecide: How to predict best pretraining data with small experiments.arXiv preprint arXiv:2504.11393, 2025

work page arXiv 2025
[65]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Llms on the line: Data determines loss-to-loss scaling laws

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, and Wieland Brendel. Llms on the line: Data determines loss-to-loss scaling laws. InICML, 2025

2025
[67]

Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens, Natalia Fischl-Lanzoni, and Neil Thompson. Is there" secret sauce”in large language model development?arXiv preprint arXiv:2602.07238, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026
[68]

Adding error bars to evals: A statistical approach to language model evaluations

Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

work page arXiv 2024
[69]

Distributional training data attribution: What do influence functions sample? 2025

Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E Turner, and Roger Grosse. Distributional training data attribution: What do influence functions sample? 2025

2025
[70]

Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Alek- sandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

2023
[71]

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman. Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms. InICLR 2026, 2026

2026
[72]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757, 2024

David Owen. How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757, 2024

work page arXiv 2024
[74]

100 instances is all you need: predicting the success of a new llm on unseen data by testing on a few instances.arXiv preprint arXiv:2409.03563, 2024

Lorenzo Pacchiardi, Lucy G Cheke, and José Hernández-Orallo. 100 instances is all you need: predicting the success of a new llm on unseen data by testing on a few instances.arXiv preprint arXiv:2409.03563, 2024

work page arXiv 2024
[75]

Machine unlearning fails to remove data poisoning attacks

Martin Pawelczyk, Jimmy Z Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, and Seth Neel. Machine unlearning fails to remove data poisoning attacks. InInternational Conference on Learning Representations (ICLR), 2025

2025
[76]

Cambridge university press, 2009

Judea Pearl.Causality. Cambridge university press, 2009

2009
[77]

Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

2019
[78]

tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024
[79]

Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna

Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. AI and the everything in the whole wide world benchmark. InNeural Information Processing Systems(NeurIPS, 2021

2021
[80]

Kochenderfer

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. BetterBench: Assessing AI benchmarks, uncovering issues, and establishing best practices. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

Showing first 80 references.

[1] [1]

One-shot empirical privacy estimation for federated learning

Galen Andrew, Peter Kairouz, Sewoong Oh, Alina Oprea, H Brendan McMahan, and Vinith M Suriyakumar. One-shot empirical privacy estimation for federated learning. InInternational Conference on Learning Representations, 2024

2024

[2] [2]

If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 35:17953–17967, 2022

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 35:17953–17967, 2022

2022

[3] [3]

Training data attribution via approximate unrolled differentiation.arXiv preprint arXiv:2405.12186, 2024

Juhan Bae, Wu Lin, Jonathan Lorraine, and Roger Grosse. Training data attribution via approximate unrolled differentiation.arXiv preprint arXiv:2405.12186, 2024

work page arXiv 2024

[4] [4]

Measuring what matters: Construct validity in large language model benchmarks

Andrew M Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, et al. Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[5] [5]

Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv:2505.13738, 2025

Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv:2505.13738, 2025

work page arXiv 2025

[6] [6]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

2023

[8] [8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Train once, answer all: Many pretraining experiments for the cost of one

Sebastian Bordt and Martin Pawelczyk. Train once, answer all: Many pretraining experiments for the cost of one. InICLR, 2026

2026

[10] [10]

Data taggants: Dataset owner- ship verification via harmless targeted data poisoning

Wassim Bouaziz, Nicolas Usunier, and El-Mahdi El-Mhamdi. Data taggants: Dataset owner- ship verification via harmless targeted data poisoning. 2025

2025

[11] [11]

Accounting for variance in machine learning benchmarks.Proceedings of machine learning and systems, 2021

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram V oleti, et al. Accounting for variance in machine learning benchmarks.Proceedings of machine learning and systems, 2021

2021

[12] [12]

Samuel Bowman and George Dahl. What will it take to fix benchmarking in natural lan- guage understanding? InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, 2021

2021

[13] [13]

Language models are few-shot learners.NeurIPS, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS, 2020

2020

[14] [14]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021. 10

2021

[15] [15]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

2024

[16] [16]

A hitchhiker’s guide to scaling law estimation.arXiv preprint arXiv:2410.11840, 2024

Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation.arXiv preprint arXiv:2410.11840, 2024

work page arXiv 2024

[17] [17]

Provably robust watermarks for open-source language models.TMLR, 2026

Miranda Christ, Sam Gunn, Tal Malkin, and Mariana Raykova. Provably robust watermarks for open-source language models.TMLR, 2026

2026

[18] [18]

The variety-of-evidence thesis: A bayesian exploration of its surprising failures.Synthese, 2019

François Claveau and Olivier Grenier. The variety-of-evidence thesis: A bayesian exploration of its surprising failures.Synthese, 2019

2019

[19] [19]

The rising costs of training frontier AI models.arXiv preprint arXiv:2405.21015, 2024

Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024

work page arXiv 2024

[20] [20]

The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

work page arXiv 2021

[21] [21]

Constat: Performance-based contamina- tion detection in large language models.Advances in Neural Information Processing Systems, 37:92420–92464, 2024

Jasper Dekoninck, Mark Müller, and Martin Vechev. Constat: Performance-based contamina- tion detection in large language models.Advances in Neural Information Processing Systems, 37:92420–92464, 2024

2024

[22] [22]

Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023

Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023

work page arXiv 2023

[23] [23]

Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

work page arXiv 2024

[24] [24]

Gaussian differential privacy.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022

Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022

2022

[25] [25]

Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

work page arXiv 2025

[26] [26]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024

[27] [27]

Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

work page arXiv 2022

[28] [28]

Data on AI models, 3 2026

Epoch AI. Data on AI models, 3 2026. URL https://epoch.ai/data/ai-models. Ac- cessed: 21 Mar 2026

2026

[29] [29]

What neural networks memorize and why: Discovering the long tail via influence estimation.Advances in Neural Information Processing Systems, 33: 2881–2891, 2020

Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation.Advances in Neural Information Processing Systems, 33: 2881–2891, 2020

2020

[30] [30]

Open llm leaderboard v2

Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2024

2024

[31] [31]

The benchmarking epistemology: Construct validity for evaluating machine learning models.arXiv preprint arXiv:2510.23191, 2025

Timo Freiesleben and Sebastian Zezulka. The benchmarking epistemology: Construct validity for evaluating machine learning models.arXiv preprint arXiv:2510.23191, 2025

work page arXiv 2025

[32] [32]

A swiss army infinitesimal jackknife

Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick. A swiss army infinitesimal jackknife. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 1139–1147. PMLR, 2019. 11

2019

[33] [33]

Scaling laws for data filtering–data curation cannot be compute agnostic

Sachin Goyal, Pratyush Maini, Zachary C Lipton, Aditi Raghunathan, and J Zico Kolter. Scaling laws for data filtering–data curation cannot be compute agnostic. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22702–22711, 2024

2024

[34] [34]

Olmo: Accelerating the science of language models

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15789–15809, 2024

2024

[35] [35]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

work page arXiv 2023

[36] [36]

On the surprising effectiveness of large learning rates under standard width scaling.arXiv preprint arXiv:2505.22491, 2025

Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, and Leena Chennuru Vankadara. On the surprising effectiveness of large learning rates under standard width scaling.arXiv preprint arXiv:2505.22491, 2025

work page arXiv 2025

[37] [37]

The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

Frank R Hampel. The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

1974

[38] [38]

The emerging science of machine learning benchmarks.Manuscript

Moritz Hardt. The emerging science of machine learning benchmarks.Manuscript. https://mlbenchmarks.org, 2025

2025

[39] [39]

Using big data to emulate a target trial when a randomized trial is not available.American journal of epidemiology, 183(8):758–764, 2016

Miguel A Hernán and James M Robins. Using big data to emulate a target trial when a randomized trial is not available.American journal of epidemiology, 183(8):758–764, 2016

2016

[40] [40]

Position: Why we must rethink empirical research in machine learning

Moritz Herrmann, F Julian D Lange, Katharina Eggensperger, Giuseppe Casalicchio, Marcel Wever, Matthias Feurer, David Rügamer, Eyke Hüllermeier, Anne-Laure Boulesteix, and Bernd Bischl. Position: Why we must rethink empirical research in machine learning. In International Conference on Machine Learning (ICML), 2024

2024

[41] [41]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

Algorithmic progress in language models

Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan C Guo, David Atkinson, Neil Thompson, and Jaime Sevilla. Algorithmic progress in language models. Advances in Neural Information Processing Systems, 37:58245–58283, 2024

2024

[43] [43]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 375–385. ACM, 2021. doi: 10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021

[46] [46]

Measuring forgetting of memorized training examples

Matthew Jagielski, Om Thakkar, Florian Tramer, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Guha Thakurta, Nicolas Papernot, and Chiyuan Zhang. Measuring forgetting of memorized training examples. InICLR, 2023

2023

[47] [47]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[48] [48]

How well can differential privacy be audited in one run?arXiv preprint arXiv:2503.07199, 2025

Amit Keinan, Moshe Shenfeld, and Katrina Ligett. How well can differential privacy be audited in one run?arXiv preprint arXiv:2503.07199, 2025. 12

work page arXiv 2025

[49] [49]

Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025

Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025

work page arXiv 2025

[50] [50]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

2017

[51] [51]

Predicting llm reasoning performance with small proxy model

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, and Jamin Shin. Predicting llm reasoning performance with small proxy model. 2026

2026

[52] [52]

Revisiting the scaling properties of downstream metrics in large language model training

Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, and Jason Ramapuram. Revisiting the scaling properties of downstream metrics in large language model training. arXiv preprint arXiv:2512.08894, 2025

work page arXiv 2025

[53] [53]

Gaussian membership inference privacy.Advances in Neural Information Processing Systems, 36:73866–73878, 2023

Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Gaussian membership inference privacy.Advances in Neural Information Processing Systems, 36:73866–73878, 2023

2023

[54] [54]

Causal estimation of memorisation profiles.arXiv preprint arXiv:2406.04327, 2024

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, and Tiago Pimentel. Causal estimation of memorisation profiles.arXiv preprint arXiv:2406.04327, 2024

work page arXiv 2024

[55] [55]

Causal estimation of tokenisation bias.arXiv preprint arXiv:2506.03149, 2025

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, and Tiago Pimentel. Causal estimation of tokenisation bias.arXiv preprint arXiv:2506.03149, 2025

work page arXiv 2025

[56] [56]

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

Bojie Li. Incompressible knowledge probes: Estimating black-box llm parameter counts via factual capacity, 2026. URLhttps://arxiv.org/abs/2604.24827

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, and Graham Horwood. Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

work page arXiv 2024

[58] [58]

Do influence functions work on large language models.arXiv preprint arXiv:2409.19998, 3, 2024

Zhe Li, Wei Zhao, Yige Li, and Jun Sun. Do influence functions work on large language models.arXiv preprint arXiv:2409.19998, 3, 2024

work page arXiv 2024

[59] [59]

Are we learning yet? a meta review of evaluation failures across machine learning

Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021

[60] [60]

Not-just- scaling laws: Towards a better understanding of the downstream impact of language model design decisions

Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, et al. Not-just- scaling laws: Towards a better understanding of the downstream impact of language model design decisions. InProceedings of the 2025 Conference on Empirical Methods in Natural Langu...

2025

[61] [61]

Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi, et al. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. InICLR, 2026

2026

[62] [62]

Scaling laws are unreliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885, 2025

Nicholas Lourie, Michael Y Hu, and Kyunghyun Cho. Scaling laws are unreliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885, 2025

work page arXiv 2025

[63] [63]

Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pon- tus Stenetorp, Sharan Narang, and Dieuwke Hupkes

Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pon- tus Stenetorp, Sharan Narang, and Dieuwke Hupkes. Quantifying variance in evaluation benchmarks.arXiv preprint arXiv:2406.10229, 2024

work page arXiv 2024

[64] [64]

Datadecide: How to predict best pretraining data with small experiments.arXiv preprint arXiv:2504.11393, 2025

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, et al. Datadecide: How to predict best pretraining data with small experiments.arXiv preprint arXiv:2504.11393, 2025

work page arXiv 2025

[65] [65]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Llms on the line: Data determines loss-to-loss scaling laws

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, and Wieland Brendel. Llms on the line: Data determines loss-to-loss scaling laws. InICML, 2025

2025

[67] [67]

Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens, Natalia Fischl-Lanzoni, and Neil Thompson. Is there" secret sauce”in large language model development?arXiv preprint arXiv:2602.07238, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026

[68] [68]

Adding error bars to evals: A statistical approach to language model evaluations

Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

work page arXiv 2024

[69] [69]

Distributional training data attribution: What do influence functions sample? 2025

Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E Turner, and Roger Grosse. Distributional training data attribution: What do influence functions sample? 2025

2025

[70] [70]

Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Alek- sandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

2023

[71] [71]

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman. Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms. InICLR 2026, 2026

2026

[72] [72]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757, 2024

David Owen. How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757, 2024

work page arXiv 2024

[74] [74]

100 instances is all you need: predicting the success of a new llm on unseen data by testing on a few instances.arXiv preprint arXiv:2409.03563, 2024

Lorenzo Pacchiardi, Lucy G Cheke, and José Hernández-Orallo. 100 instances is all you need: predicting the success of a new llm on unseen data by testing on a few instances.arXiv preprint arXiv:2409.03563, 2024

work page arXiv 2024

[75] [75]

Machine unlearning fails to remove data poisoning attacks

Martin Pawelczyk, Jimmy Z Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, and Seth Neel. Machine unlearning fails to remove data poisoning attacks. InInternational Conference on Learning Representations (ICLR), 2025

2025

[76] [76]

Cambridge university press, 2009

Judea Pearl.Causality. Cambridge university press, 2009

2009

[77] [77]

Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

2019

[78] [78]

tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024

[79] [79]

Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna

Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. AI and the everything in the whole wide world benchmark. InNeural Information Processing Systems(NeurIPS, 2021

2021

[80] [80]

Kochenderfer

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. BetterBench: Assessing AI benchmarks, uncovering issues, and establishing best practices. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024