pith. sign in

arxiv: 2606.05029 · v1 · pith:LOOPTAP4new · submitted 2026-06-03 · 💻 cs.LG · cs.CL

Validity Threats for Foundation Model Research

Pith reviewed 2026-06-28 06:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords validity threatsfoundation modelscausal inferenceproxy experimentsobservational studiesscaling lawsresearch designmachine learning
0
0 comments X

The pith

Savings in compute for foundation model research come at the cost of validity threats that can invalidate claims.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that full controlled experiments on foundation models have become too expensive, leading researchers to rely on approximations such as proxy experiments, scaling laws, observational studies with public models, and single-run designs that exploit internal variation. These cheaper strategies save resources but introduce hidden and sometimes untestable assumptions whose violation can undermine research conclusions. The authors frame foundation model research as a causal inference problem and evaluate the strategies using four validity types drawn from empirical social sciences: statistical, internal, external, and construct validity. Each approach shows a distinct profile, for instance proxy experiments gaining statistical and internal validity while losing external and construct validity. The resulting toolkit lets researchers identify which threats apply to their chosen design.

Core claim

Savings in compute come at the cost of validity threats—hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. The evaluation framework casts foundation model research as a causal inference problem and evaluates different research strategies through four types of validity adapted from the empirical social sciences: statistical, internal, external, and construct validity. Proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats th

What carries the argument

An evaluation framework that treats foundation model research as a causal inference problem and assesses each research strategy through four validity types (statistical, internal, external, construct) imported from empirical social sciences.

If this is right

  • Proxy experiments trade external and construct validity for statistical and internal validity.
  • Observational studies face confounding and effect heterogeneity.
  • Single-run designs are strained by interference between treated units.
  • Several validity threats have received insufficient attention in the literature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be used to score specific published papers and flag which validity threats remain unaddressed in each.
  • Combining two approximation strategies might offset some weaknesses of each, though the paper does not test this possibility.
  • Widespread adoption would likely shift publication incentives toward designs that explicitly document their validity profile.

Load-bearing premise

The four validity types imported from empirical social sciences can be applied to foundation-model research strategies without substantial redefinition or loss of diagnostic power.

What would settle it

A side-by-side test in which conclusions drawn from a proxy experiment or observational study on a foundation model are directly compared against results from a full controlled experiment on the identical question; systematic disagreement would support the claimed validity threats while consistent agreement would falsify them.

Figures

Figures reproduced from arXiv: 2606.05029 by Gunnar K\"onig, Martin Pawelczyk, Sebastian Bordt, Ulrike von Luxburg.

Figure 1
Figure 1. Figure 1: Foundation model research as a causal inference problem. To understand and improve foundation models, we need to answer questions about how changes to a training recipe affect outcomes of interest—for example, how adding math data affects reasoning benchmark scores. At frontier scale, the ideal experiment that identifies this treatment effect must be approximated. An evaluation framework for foundation mod… view at source ↗
Figure 2
Figure 2. Figure 2: The proxy approach. We conduct an experiment but replace one or several parts with proxies ( ) to make it computationally feasible. For example, we conduct the experiment with a proxy model (1B instead of 90B parameters), a proxy treatment (fine-tuning instead of pre-training on math data), or a proxy outcome (validation loss instead of task performance). learning rate are adjusted with scale. This means t… view at source ↗
Figure 3
Figure 3. Figure 3: The observational approach. Instead of running new experiments, we can leverage the publicly available meta-data of existing training runs. This data is observational, meaning that we cannot perform treatments and observe their effects. Instead we have to use the statistical relationships between recipe components and outcomes in combination with (untestable) causal assumptions to identify treatment effect… view at source ↗
Figure 4
Figure 4. Figure 4: Analyzing a single training run. Single-run analysis emulates multiple training runs (left) from one (right). Here, documents are regarded as independent “units”, their assignment to the training set as the “treatment”, and their respective loss as the “outcome”. Validity threats. • Due to confounding and selection bias, internal validity is a key threat for the observational approach, requiring strong, un… view at source ↗
Figure 5
Figure 5. Figure 5: We illustrated the case of confounding in detail in Section 4. As follows we discuss selection bias. While confounding concerns the data-generating process itself, selection bias arises from the process by which observations are included in the sample.3 For example, Singh et al. [92] show that some labs evaluate many model variants on private benchmark data before publishing the best score. Conditioning on… view at source ↗
read the original abstract

Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript argues that cost-saving approximations to controlled experiments in foundation model research—proxy experiments, observational studies, and single-run designs—introduce validity threats in the form of hidden assumptions that can invalidate claims. It proposes an evaluation framework that recasts foundation model research as a causal inference problem and applies four validity types (statistical, internal, external, and construct) adapted from empirical social sciences to produce characteristic validity profiles for each strategy, revealing several insufficiently addressed threats.

Significance. If the adaptation holds, the framework supplies a practical, structured toolkit for scrutinizing validity in compute-constrained FM research. It explicitly maps strategies to threats such as confounding and effect heterogeneity in observational studies and interference in single-run designs, potentially improving the reliability of claims that rely on these approximations.

major comments (1)
  1. [Abstract] Abstract, paragraph introducing the evaluation framework: the claim that the four validity types transfer from empirical social sciences without substantial redefinition or loss of diagnostic power is load-bearing for the central argument that the framework provides a practical toolkit. The manuscript does not supply a detailed mapping or concrete demonstration that these criteria surface threats beyond those already identified in the ML literature on scaling laws and proxy tasks.
minor comments (1)
  1. [Abstract] The abstract states that the analysis 'reveals several validity threats that have received insufficient attention' but does not enumerate them; adding an explicit list or summary table early in the paper would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the positive assessment of the framework's potential utility. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract, paragraph introducing the evaluation framework: the claim that the four validity types transfer from empirical social sciences without substantial redefinition or loss of diagnostic power is load-bearing for the central argument that the framework provides a practical toolkit. The manuscript does not supply a detailed mapping or concrete demonstration that these criteria surface threats beyond those already identified in the ML literature on scaling laws and proxy tasks.

    Authors: We agree that an explicit demonstration of the transfer is important for the claim. The manuscript adapts the four validity definitions directly from the cited social-science sources (e.g., Shadish, Cook & Campbell) without redefinition and then applies them in Sections 3–5 to derive strategy-specific threat profiles. These applications identify threats such as interference between treated units in single-run designs and effect heterogeneity in observational studies of public checkpoints—issues that receive little attention in the scaling-law and proxy-task literature. To make the mapping more transparent and to strengthen the load-bearing claim, we will add a summary table in the revised manuscript that cross-tabulates each validity criterion against the three research strategies, with explicit contrasts to existing ML discussions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct application of external social-science concepts with no self-referential derivations or fitted inputs

full rationale

The paper contains no equations, fitted parameters, or derivations. It proposes an evaluation framework by directly adapting four established validity types (statistical, internal, external, construct) from empirical social sciences to foundation-model research strategies, without any reduction of claims to self-citations, self-definitions, or renamed inputs. The central premise—that compute-saving strategies introduce validity threats—is presented as an application of external concepts rather than a result derived from the paper's own prior outputs. This is the most common honest non-finding for conceptual framework papers that import independent external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on the domain assumption that foundation-model research can be usefully cast as a causal-inference problem and that the four validity types transfer without major adaptation. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Foundation model research strategies can be treated as attempts to perform causal inference.
    Stated in the sentence that introduces the evaluation framework.
  • domain assumption The four validity types from empirical social sciences apply directly to ML research designs.
    Invoked when the paper evaluates each strategy through statistical, internal, external, and construct validity.

pith-pipeline@v0.9.1-grok · 5766 in / 1285 out tokens · 20970 ms · 2026-06-28T06:50:23.663030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

116 extracted references · 49 canonical work pages · 12 internal anchors

  1. [1]

    One-shot empirical privacy estimation for federated learning

    Galen Andrew, Peter Kairouz, Sewoong Oh, Alina Oprea, H Brendan McMahan, and Vinith M Suriyakumar. One-shot empirical privacy estimation for federated learning. InInternational Conference on Learning Representations, 2024

  2. [2]

    If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 35:17953–17967, 2022

    Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 35:17953–17967, 2022

  3. [3]

    Training data attribution via approximate unrolled differentiation.arXiv preprint arXiv:2405.12186, 2024

    Juhan Bae, Wu Lin, Jonathan Lorraine, and Roger Grosse. Training data attribution via approximate unrolled differentiation.arXiv preprint arXiv:2405.12186, 2024

  4. [4]

    Measuring what matters: Construct validity in large language model benchmarks

    Andrew M Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, et al. Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  5. [5]

    Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv:2505.13738, 2025

    Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv:2505.13738, 2025

  6. [6]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

  7. [7]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

  8. [8]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  9. [9]

    Train once, answer all: Many pretraining experiments for the cost of one

    Sebastian Bordt and Martin Pawelczyk. Train once, answer all: Many pretraining experiments for the cost of one. InICLR, 2026

  10. [10]

    Data taggants: Dataset owner- ship verification via harmless targeted data poisoning

    Wassim Bouaziz, Nicolas Usunier, and El-Mahdi El-Mhamdi. Data taggants: Dataset owner- ship verification via harmless targeted data poisoning. 2025

  11. [11]

    Accounting for variance in machine learning benchmarks.Proceedings of machine learning and systems, 2021

    Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram V oleti, et al. Accounting for variance in machine learning benchmarks.Proceedings of machine learning and systems, 2021

  12. [12]

    Samuel Bowman and George Dahl. What will it take to fix benchmarking in natural lan- guage understanding? InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, 2021

  13. [13]

    Language models are few-shot learners.NeurIPS, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS, 2020

  14. [14]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021. 10

  15. [15]

    Chatbot arena: An open platform for evaluating llms by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

  16. [16]

    A hitchhiker’s guide to scaling law estimation.arXiv preprint arXiv:2410.11840, 2024

    Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation.arXiv preprint arXiv:2410.11840, 2024

  17. [17]

    Provably robust watermarks for open-source language models.TMLR, 2026

    Miranda Christ, Sam Gunn, Tal Malkin, and Mariana Raykova. Provably robust watermarks for open-source language models.TMLR, 2026

  18. [18]

    The variety-of-evidence thesis: A bayesian exploration of its surprising failures.Synthese, 2019

    François Claveau and Olivier Grenier. The variety-of-evidence thesis: A bayesian exploration of its surprising failures.Synthese, 2019

  19. [19]

    The rising costs of training frontier AI models.arXiv preprint arXiv:2405.21015, 2024

    Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024

  20. [20]

    The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

    Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

  21. [21]

    Constat: Performance-based contamina- tion detection in large language models.Advances in Neural Information Processing Systems, 37:92420–92464, 2024

    Jasper Dekoninck, Mark Müller, and Martin Vechev. Constat: Performance-based contamina- tion detection in large language models.Advances in Neural Information Processing Systems, 37:92420–92464, 2024

  22. [22]

    Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023

    Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023

  23. [23]

    Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

    Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

  24. [24]

    Gaussian differential privacy.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022

    Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022

  25. [25]

    Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

    Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

  26. [26]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  27. [27]

    Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

    Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

  28. [28]

    Data on AI models, 3 2026

    Epoch AI. Data on AI models, 3 2026. URL https://epoch.ai/data/ai-models. Ac- cessed: 21 Mar 2026

  29. [29]

    What neural networks memorize and why: Discovering the long tail via influence estimation.Advances in Neural Information Processing Systems, 33: 2881–2891, 2020

    Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation.Advances in Neural Information Processing Systems, 33: 2881–2891, 2020

  30. [30]

    Open llm leaderboard v2

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2024

  31. [31]

    The benchmarking epistemology: Construct validity for evaluating machine learning models.arXiv preprint arXiv:2510.23191, 2025

    Timo Freiesleben and Sebastian Zezulka. The benchmarking epistemology: Construct validity for evaluating machine learning models.arXiv preprint arXiv:2510.23191, 2025

  32. [32]

    A swiss army infinitesimal jackknife

    Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick. A swiss army infinitesimal jackknife. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 1139–1147. PMLR, 2019. 11

  33. [33]

    Scaling laws for data filtering–data curation cannot be compute agnostic

    Sachin Goyal, Pratyush Maini, Zachary C Lipton, Aditi Raghunathan, and J Zico Kolter. Scaling laws for data filtering–data curation cannot be compute agnostic. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22702–22711, 2024

  34. [34]

    Olmo: Accelerating the science of language models

    Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15789–15809, 2024

  35. [35]

    Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

  36. [36]

    On the surprising effectiveness of large learning rates under standard width scaling.arXiv preprint arXiv:2505.22491, 2025

    Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, and Leena Chennuru Vankadara. On the surprising effectiveness of large learning rates under standard width scaling.arXiv preprint arXiv:2505.22491, 2025

  37. [37]

    The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

    Frank R Hampel. The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

  38. [38]

    The emerging science of machine learning benchmarks.Manuscript

    Moritz Hardt. The emerging science of machine learning benchmarks.Manuscript. https://mlbenchmarks.org, 2025

  39. [39]

    Using big data to emulate a target trial when a randomized trial is not available.American journal of epidemiology, 183(8):758–764, 2016

    Miguel A Hernán and James M Robins. Using big data to emulate a target trial when a randomized trial is not available.American journal of epidemiology, 183(8):758–764, 2016

  40. [40]

    Position: Why we must rethink empirical research in machine learning

    Moritz Herrmann, F Julian D Lange, Katharina Eggensperger, Giuseppe Casalicchio, Marcel Wever, Matthias Feurer, David Rügamer, Eyke Hüllermeier, Anne-Laure Boulesteix, and Bernd Bischl. Position: Why we must rethink empirical research in machine learning. In International Conference on Machine Learning (ICML), 2024

  41. [41]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

  42. [42]

    Algorithmic progress in language models

    Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan C Guo, David Atkinson, Neil Thompson, and Jaime Sevilla. Algorithmic progress in language models. Advances in Neural Information Processing Systems, 37:58245–58283, 2024

  43. [43]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  44. [44]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024

  45. [45]

    Jacobs and Hanna Wallach

    Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 375–385. ACM, 2021. doi: 10.1145/3442188.3445901

  46. [46]

    Measuring forgetting of memorized training examples

    Matthew Jagielski, Om Thakkar, Florian Tramer, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Guha Thakurta, Nicolas Papernot, and Chiyuan Zhang. Measuring forgetting of memorized training examples. InICLR, 2023

  47. [47]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  48. [48]

    How well can differential privacy be audited in one run?arXiv preprint arXiv:2503.07199, 2025

    Amit Keinan, Moshe Shenfeld, and Katrina Ligett. How well can differential privacy be audited in one run?arXiv preprint arXiv:2503.07199, 2025. 12

  49. [49]

    Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025

    Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025

  50. [50]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

  51. [51]

    Predicting llm reasoning performance with small proxy model

    Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, and Jamin Shin. Predicting llm reasoning performance with small proxy model. 2026

  52. [52]

    Revisiting the scaling properties of downstream metrics in large language model training

    Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, and Jason Ramapuram. Revisiting the scaling properties of downstream metrics in large language model training. arXiv preprint arXiv:2512.08894, 2025

  53. [53]

    Gaussian membership inference privacy.Advances in Neural Information Processing Systems, 36:73866–73878, 2023

    Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Gaussian membership inference privacy.Advances in Neural Information Processing Systems, 36:73866–73878, 2023

  54. [54]

    Causal estimation of memorisation profiles.arXiv preprint arXiv:2406.04327, 2024

    Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, and Tiago Pimentel. Causal estimation of memorisation profiles.arXiv preprint arXiv:2406.04327, 2024

  55. [55]

    Causal estimation of tokenisation bias.arXiv preprint arXiv:2506.03149, 2025

    Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, and Tiago Pimentel. Causal estimation of tokenisation bias.arXiv preprint arXiv:2506.03149, 2025

  56. [56]

    Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

    Bojie Li. Incompressible knowledge probes: Estimating black-box llm parameter counts via factual capacity, 2026. URLhttps://arxiv.org/abs/2604.24827

  57. [57]

    Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

    Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, and Graham Horwood. Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

  58. [58]

    Do influence functions work on large language models.arXiv preprint arXiv:2409.19998, 3, 2024

    Zhe Li, Wei Zhao, Yige Li, and Jun Sun. Do influence functions work on large language models.arXiv preprint arXiv:2409.19998, 3, 2024

  59. [59]

    Are we learning yet? a meta review of evaluation failures across machine learning

    Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  60. [60]

    Not-just- scaling laws: Towards a better understanding of the downstream impact of language model design decisions

    Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, et al. Not-just- scaling laws: Towards a better understanding of the downstream impact of language model design decisions. InProceedings of the 2025 Conference on Empirical Methods in Natural Langu...

  61. [61]

    Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality

    Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi, et al. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. InICLR, 2026

  62. [62]

    Scaling laws are unreliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885, 2025

    Nicholas Lourie, Michael Y Hu, and Kyunghyun Cho. Scaling laws are unreliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885, 2025

  63. [63]

    Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pon- tus Stenetorp, Sharan Narang, and Dieuwke Hupkes

    Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pon- tus Stenetorp, Sharan Narang, and Dieuwke Hupkes. Quantifying variance in evaluation benchmarks.arXiv preprint arXiv:2406.10229, 2024

  64. [64]

    Datadecide: How to predict best pretraining data with small experiments.arXiv preprint arXiv:2504.11393, 2025

    Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, et al. Datadecide: How to predict best pretraining data with small experiments.arXiv preprint arXiv:2504.11393, 2025

  65. [65]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

  66. [66]

    Llms on the line: Data determines loss-to-loss scaling laws

    Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, and Wieland Brendel. Llms on the line: Data determines loss-to-loss scaling laws. InICML, 2025

  67. [67]

    Is there "Secret Sauce'' in Large Language Model Development?

    Matthias Mertens, Natalia Fischl-Lanzoni, and Neil Thompson. Is there" secret sauce”in large language model development?arXiv preprint arXiv:2602.07238, 2026. 13

  68. [68]

    Adding error bars to evals: A statistical approach to language model evaluations

    Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

  69. [69]

    Distributional training data attribution: What do influence functions sample? 2025

    Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E Turner, and Roger Grosse. Distributional training data attribution: What do influence functions sample? 2025

  70. [70]

    Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

    Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Alek- sandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

  71. [71]

    Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms

    Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman. Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms. InICLR 2026, 2026

  72. [72]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  73. [73]

    How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757, 2024

    David Owen. How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757, 2024

  74. [74]

    100 instances is all you need: predicting the success of a new llm on unseen data by testing on a few instances.arXiv preprint arXiv:2409.03563, 2024

    Lorenzo Pacchiardi, Lucy G Cheke, and José Hernández-Orallo. 100 instances is all you need: predicting the success of a new llm on unseen data by testing on a few instances.arXiv preprint arXiv:2409.03563, 2024

  75. [75]

    Machine unlearning fails to remove data poisoning attacks

    Martin Pawelczyk, Jimmy Z Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, and Seth Neel. Machine unlearning fails to remove data poisoning attacks. InInternational Conference on Learning Representations (ICLR), 2025

  76. [76]

    Cambridge university press, 2009

    Judea Pearl.Causality. Cambridge university press, 2009

  77. [77]

    Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

  78. [78]

    tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

  79. [79]

    Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna

    Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. AI and the everything in the whole wide world benchmark. InNeural Information Processing Systems(NeurIPS, 2021

  80. [80]

    Kochenderfer

    Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. BetterBench: Assessing AI benchmarks, uncovering issues, and establishing best practices. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

Showing first 80 references.