pith. machine review for the scientific record. sign in

arxiv: 2304.01373 · v2 · submitted 2023-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelstraining dynamicsmodel scalingcheckpointsmemorizationgender biasfew-shot learningpublic datasets
0
0 comments X

The pith

A suite of 16 language models trained on identical public data in the same order from 70M to 12B parameters enables direct tracking of how abilities emerge during training and across scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Pythia as a controlled collection of 16 large language models trained on the exact same sequence of public data. The models span sizes from 70 million to 12 billion parameters, and the authors release 154 checkpoints per model plus tools to reconstruct the precise training data loaders. This setup supports studies of how model behaviors develop step by step through training and how those patterns shift as size grows. The authors demonstrate its value with case studies on memorization patterns, how word frequency influences few-shot performance, and methods for lowering gender bias. By making the full training history public, the suite aims to make research on LLM training dynamics more reproducible.

Core claim

We introduce Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics.

What carries the argument

The Pythia suite of identically ordered training runs across scales with full checkpoint releases and data-loader reconstruction tools, which allow side-by-side comparison of training trajectories.

If this is right

  • Memorization can be measured as a function of training step and model size under fixed data conditions.
  • Effects of term frequency on few-shot accuracy become isolatable across the full range of scales.
  • Targeted interventions for reducing gender bias can be tested while holding data order and content constant.
  • Scaling trends in capability emergence can be examined with finer temporal resolution than single-final-checkpoint studies allow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The controlled setup could serve as a reference point for testing whether observed training dynamics hold under different data orders or sources.
  • Researchers might now run targeted ablations on individual training stages by restarting from specific checkpoints.
  • The public data loaders open the possibility of quantifying how much variance in LLM behavior is attributable to data sequence versus model size alone.

Load-bearing premise

That training all models on the exact same public data in identical order, combined with released checkpoints, will produce reproducible and generalizable insights into training dynamics without major unaccounted confounding from data selection or implementation details.

What would settle it

Independent labs using the released checkpoints and data loaders obtain inconsistent results on the memorization or bias case studies when they vary only random seeds or minor implementation details.

read the original abstract

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Pythia, a suite of 16 LLMs (70M–12B parameters) trained on identical public data in fixed order, releasing 154 checkpoints per model plus dataloader reconstruction tools. Case studies illustrate applications to memorization, term-frequency effects on few-shot performance, and gender-bias reduction, arguing that the controlled setup yields novel insights into training dynamics and scaling.

Significance. If the artifacts match the description, the contribution is significant: a public, reproducible resource that removes data-order confounding for studies of LLM training. The explicit release of models, checkpoints, training code, and dataloader tools directly supports reproducible research on emergence and scaling, a clear strength for the field.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'novel results' on memorization and bias would be strengthened by a single quantitative comparison (e.g., delta in memorization rate versus prior work) rather than qualitative assertion.
  2. [Section 3] Section 3: the exact tokenization and packing details for the shared dataloader should be stated explicitly so that independent re-implementations can match the released checkpoints without ambiguity.
  3. [Figure 2] Figure 2 and associated text: axis labels and legend entries are too small for print; increasing font size would improve readability of the scaling curves.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the Pythia suite, recognition of its significance for reproducible research on training dynamics, and recommendation to accept. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; contribution is artifact release

full rationale

The paper's core claim is the introduction and public release of 16 models (70M–12B) trained on identical public data in fixed order, with 154 checkpoints each and dataloader reconstruction tools. No derivation chain, equations, or first-principles predictions are present; the work consists of controlled empirical artifacts and case-study demonstrations. These do not reduce to fitted parameters, self-citations, or ansatzes by construction. The setup is self-contained against external benchmarks via public data and code, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that identical data ordering across model sizes removes confounding variables; no free parameters or new entities are introduced beyond the model suite itself.

axioms (1)
  • domain assumption Large language models trained on the same public data in identical order will produce comparable and analyzable training trajectories across scales.
    Invoked to justify the utility of the shared training setup for studying dynamics.

pith-pipeline@v0.9.0 · 5523 in / 1280 out tokens · 54916 ms · 2026-05-15T17:39:56.487514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

    cs.CL 2026-05 unverdicted novelty 7.0

    Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

  2. Theoretical Limits of Language Model Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

  3. Layer Collapse in Diffusion Language Models

    cs.LG 2026-05 conditional novelty 7.0

    Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...

  4. Layer Collapse in Diffusion Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

  5. DDO-RM: Distribution-Level Policy Improvement after Reward Learning

    stat.ML 2026-04 unverdicted novelty 7.0

    DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.

  6. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  7. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  8. Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

    cs.LG 2026-05 unverdicted novelty 6.0

    Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.

  9. Feature Starvation as Geometric Instability in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...

  10. Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency

    cs.LG 2026-05 unverdicted novelty 6.0

    A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.

  11. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  12. Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner

    cs.LG 2026-04 unverdicted novelty 6.0

    A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.

  13. Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

    cs.LG 2026-04 unverdicted novelty 6.0

    RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...

  14. Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in ex...

  15. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    cs.LG 2024-07 unverdicted novelty 6.0

    Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

  16. MiniLLM: On-Policy Distillation of Large Language Models

    cs.CL 2023-06 conditional novelty 6.0

    MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

  17. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    cs.CL 2023-06 unverdicted novelty 6.0

    Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

  18. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    cs.CL 2023-05 conditional novelty 6.0

    UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

  19. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.

  20. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  21. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

198 extracted references · 198 canonical work pages · cited by 19 Pith papers · 25 internal anchors

  1. [1]

    J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., et al

    Ahdritz, G., Bouatta, N., Kadyan, S., Xia, Q., Gerecke, W., O'Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., et al. Openfold: Retraining alphafold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022

  2. [2]

    GPT-NeoX : Large scale autoregressive language modeling in PyTorch , 8 2021

    Andonian, A., Anthony, Q., Biderman, S., Black, S., Gali, P., Gao, L., Hallahan, E., Levy-Kramer, J., Leahy, C., Nestler, L., Parker, K., Pieler, M., Purohit, S., Songz, T., Phil, W., and Weinbach, S. GPT-NeoX : Large scale autoregressive language modeling in PyTorch , 8 2021. URL https://www.github.com/eleutherai/gpt-neox

  3. [3]

    H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N

    Bach, S. H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-David, S., Xu, C., Chhablani, G., Wang, H., Fries, J. A., Al-shaibani, M. S., Sharma, S., Thakker, U., Almubarak, K., Tang, X., Tang, X., Jiang, M. T.-J., and Rush, A. M. Promptsource: An ...

  4. [6]

    S., Sutawika, L., Purohit, S., Schoelkopf, H., Anthony, Q., and Raff, E

    Biderman, S., Prashanth, U. S., Sutawika, L., Purohit, S., Schoelkopf, H., Anthony, Q., and Raff, E. Emergent and predictable memorization in large language models. Preprint under review, 2023

  5. [8]

    GPT-Neo : Large scale autoregressive language modeling with Mesh-TensorFlow

    Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. GPT-Neo : Large scale autoregressive language modeling with Mesh-TensorFlow . GitHub, 2021. URL https://www.github.com/eleutherai/gpt-neo

  6. [9]

    GPT-NeoX-20B : An open-source autoregressive language model

    Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. GPT-NeoX-20B : An open-source autoregressive language model. In Proceedings of BigScience Episode \#5--Workshop on Challenges & Perspectives in Creating Large Language Models, pp.\ 95--136, 2022

  7. [11]

    and Bowman, S

    Bordia, S. and Bowman, S. Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 7--15, 2019

  8. [12]

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

  9. [13]

    Understanding the origins of bias in word embeddings

    Brunet, M.-E., Alkalay-Houlihan, C., Anderson, A., and Zemel, R. Understanding the origins of bias in word embeddings. In International conference on machine learning, pp.\ 803--811. PMLR, 2019

  10. [14]

    The secret sharer: Evaluating and testing unintended memorization in neural networks

    Carlini, N., Liu, C., Erlingsson, \'U ., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp.\ 267--284, 2019

  11. [15]

    Extracting training data from large language models

    Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650, 2021

  12. [17]

    Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus

    Caswell, I., Breiner, T., van Esch, D., and Bapna, A. Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pp.\ 6588--6608, 2020

  13. [19]

    Choenni, R., Shutova, E., and van Rooij, R. Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1477--1491, 2021

  14. [24]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus

    Dodge, J., Sap, M., Marasovi \'c , A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1286--1305, 2021

  15. [25]

    D., et al

    D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M. D., et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 2020

  16. [28]

    On the sizes of openai api models

    Gao, L. On the sizes of openai api models. EleutherAI Blog, 2021

  17. [33]

    Debiasing pre-trained language models via efficient fine-tuning

    Gira, M., Zhang, R., and Lee, K. Debiasing pre-trained language models via efficient fine-tuning. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, pp.\ 59--69, 2022

  18. [39]

    Quantifying societal bias amplification in image captioning

    Hirota, Y., Nakashima, Y., and Garcia, N. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13450--13459, 2022

  19. [41]

    S., and Zhang, X

    Hu, H., Salcic, Z., Sun, L., Dobbie, G., Yu, P. S., and Zhang, X. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54 0 (11s): 0 1--37, 2022

  20. [45]

    S., Subramani, N., Johnson, I., et al

    Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., Tan, S., Luccioni, A. S., Subramani, N., Johnson, I., et al. Data governance in the age of large-scale data-driven language technology. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.\ 2206--2222, 2022

  21. [46]

    S., and Zettlemoyer, L

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

  22. [47]

    Highly accurate protein structure prediction with alphafold

    Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' i dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. Nature, 596 0 (7873): 0 583--589, 2021

  23. [50]

    D., Potts, C., Ré, C., and Liang, P

    Karamcheti, S., Orr, L., Bolton, J., Zhang, T., Goel, K., Narayan, A., Bommasani, R., Narayanan, D., Hashimoto, T., Jurafsky, D., Manning, C. D., Potts, C., Ré, C., and Liang, P. Mistral - a journey towards reproducible language model training, 2021. URL https://github.com/stanford-crfm/mistral

  24. [53]

    V., Scao, T

    Lauren c on, H., Saulnier, L., Wang, T., Akiki, C., del Moral, A. V., Scao, T. L., Werra, L. V., Mou, C., Ponferrada, E. G., Nguyen, H., Frohberg, J., S a s ko, M., Lhoest, Q., McMillan-Major, A., Dupont, G., Biderman, S., Rogers, A., allal, L. B., Toni, F. D., Pistilli, G., Nguyen, O., Nikpoor, S., Masoud, M., Colombo, P., de la Rosa, J., Villegas, P., T...

  25. [54]

    S., Biderman, S., Elsahar, H., Phang, J., Press, O., et al

    Le Scao, T., Wang, T., Hesslow, D., Saulnier, L., Bekman, S., Bari, M. S., Biderman, S., Elsahar, H., Phang, J., Press, O., et al. What language model to train if you have one million GPU hours? In Proceedings of BigScience Episode \#5--Workshop on Challenges & Perspectives in Creating Large Language Models, 2022

  26. [55]

    Deduplicating training data makes language models better

    Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating training data makes language models better. In Annual Meeting of the Association for Computational Linguistics, 2021

  27. [56]

    Collecting a large-scale gender bias dataset for coreference resolution and machine translation

    Levy, S., Lazar, K., and Stanovsky, G. Collecting a large-scale gender bias dataset for coreference resolution and machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.\ 2470--2480, 2021

  28. [57]

    Jurassic-1 : Technical details and evaluation

    Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1 : Technical details and evaluation. White Paper. AI21 Labs, 2021

  29. [58]

    When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories

    Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

  30. [68]

    Scaling effect of self-supervised speech models

    Pu, J., Yang, Y., Li, R., Elibol, O., and Droppo, J. Scaling effect of self-supervised speech models. Proc. Interspeech 2021, pp.\ 1084--1088, 2021

  31. [69]

    Language models are unsupervised multitask learners

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://openai.com/blog/better-language-models/

  32. [70]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 01 2020. ISSN 1532-4435. URL http://jmlr.org/papers/v21/20-074.html

  33. [74]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022

  34. [79]

    On the effect of pretraining corpora on in-context learning by a large-scale language model

    Shin, S., Lee, S.-W., Ahn, H., Kim, S., Kim, H., Kim, B., Cho, K., Lee, G., Park, W., Ha, J.-W., and Sung, N. On the effect of pretraining corpora on in-context learning by a large-scale language model. 2022

  35. [81]

    Cross-loss influence functions to explain deep network representations

    Silva, A., Chopra, R., and Gombolay, M. Cross-loss influence functions to explain deep network representations. In International Conference on Artificial Intelligence and Statistics, pp.\ 1--17. PMLR, 2022

  36. [84]

    Su \'a rez, P. J. O., Sagot, B., and Romary, L. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache, 2019

  37. [85]

    WuDao : Pretrain the world

    Tang, J. WuDao : Pretrain the world. Keynote adress at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2021

  38. [86]

    Tirumala, K. N. B., Markosyan, A. H., Zettlemoyer, L., and Aghajanyan, A. Memorization without overfitting: Analyzing the training dynamics of large language models. ArXiv, abs/2205.10770, 2022

  39. [88]

    The birth of bias: A case study on the evolution of gender bias in an english language model

    Van der Wal, O., Jumelet, J., Schulz, K., and Zuidema, W. The birth of bias: A case study on the evolution of gender bias in an english language model. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp.\ 75--75, 2022 b

  40. [89]

    and Komatsuzaki, A

    Wang, B. and Komatsuzaki, A. GPT-J-6B : A 6 billion parameter autoregressive language model, 2021

  41. [91]

    V., Pasunuru, R., Chen, D., Zettlemoyer, L., and Stoyanov, V

    Xia, M., Artetxe, M., Zhou, C., Lin, X. V., Pasunuru, R., Chen, D., Zettlemoyer, L., and Stoyanov, V. Training trajectories of language models across scales, 2022. URL https://arxiv.org/abs/2212.09803

  42. [93]

    and Lee, H

    Yoon, S. and Lee, H. Which model is helpful in solving privacy, memorization, and bias problems? 2021. URL https://soyoung97.github.io/profile/assets/papers/CS774.pdf

  43. [95]

    Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019

  44. [98]

    Gender bias in coreference resolution: Evaluation and debiasing methods

    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.\ 15--20, 2018

  45. [99]

    OpenAI Blog , year=

    Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

  46. [100]

    Journal of Machine Learning Research , issn=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , issn=. 2020 , month=

  47. [101]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor=

    Unsupervised Cross-lingual Representation Learning at Scale , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor=. 2020 , month=. doi:10.18653/v1/2020.acl-main.747 , pages=

  48. [102]

    Computing Research Repository , eprint=

    Lifting the Curse of Multilinguality by Pre-training Modular Transformers , author=. Computing Research Repository , eprint=. 2022 , month=. doi:10.48550/arXiv.2205.06266 , url=

  49. [103]

    EleutherAI Blog , year=

    On the sizes of OpenAI API models , author=. EleutherAI Blog , year=

  50. [104]

    Computing Research Repository , eprint=

    Scaling Laws and Interpretability of Learning from Repeated Data , author=. Computing Research Repository , eprint=. 2022 , month=. doi:10.48550/arXiv.2205.10487 , url=

  51. [105]

    Advances in Neural Information Processing Systems , pages =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , pages =

  52. [106]

    Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , journal=

  53. [107]

    Wang, Ben and Komatsuzaki, Aran , year=

  54. [108]

    Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , booktitle=

  55. [109]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. 2021 , journal=. doi:10.48550/arXiv.2110.08207 , url=. 2110.08207 , note=

  56. [110]

    2021 , journal=

    Finetuned Language Models Are Zero-Shot Learners , author=. 2021 , journal=. doi:10.48550/arXiv.2109.0165 , url=. 2109.0165 , note=

  57. [111]

    Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav , journal=

  58. [112]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi and Mostofa Patwary and Raul Puri and Patrick LeGresley and Jared Casper and Bryan Catanzaro , title =. 2019 , journal=. doi:10.48550/arXiv.1909.08053 , url=. 1909.08053 , note=

  59. [113]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. 2022 , journal=. doi:10.48550/arXiv.2201.11990 , url=. 2201.11990 , note=

  60. [114]

    2021 , Version =

    Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel , url =. 2021 , Version =. doi:10.5281/zenodo.5879544 , month =

  61. [115]

    Computing Research Repository , eprint=

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , year=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2204.0231 , url=

  62. [116]

    Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ilić, Suzana and Hesslow, Daniel and Castagné, Roman and Luccioni, Alexandra Sasha and Yvon, François and Gallé, Matthias and Tow, Jonathan and Rush, Alexander M. and Biderman, Stella and Webson, Albert and Ammanamanchi, Pawan Sasanka and Wang, Thomas and Sagot, Benoît and Muenni...

  63. [117]

    Open Science

    EleutherAI: Going Beyond" Open Science" to" Science in the Open" , author=. arXiv preprint arXiv:2210.06413 , year=

  64. [118]

    Scaling Laws for Neural Language Models

    Scaling Laws for Neural Language Models , author=. 2020 , journal=. doi:10.48550/arXiv.2001.08361 , url=. 2001.08361 , note=

  65. [119]

    Scaling Laws for Autoregressive Generative Modeling

    Scaling Laws for Autoregressive Generative Modeling , author=. 2020 , journal=. doi:10.48550/arXiv.2010.14701 , url=. 2010.14701 , note=

  66. [120]

    2021 , journal=

    Scaling Laws for Transfer , author=. 2021 , journal=. doi:10.48550/arXiv.2102.01293 , url=. 2102.01293 , note=

  67. [121]

    2021 , journal=

    A Scaling Law for Synthetic-to-Real Transfer: How Much Is Your Pre-training Effective? , author=. 2021 , journal=. doi:10.48550/arXiv.2108.11018 , url=. 2108.11018 , note=

  68. [122]

    Scaling Effect of Self-Supervised Speech Models , author=. Proc. Interspeech 2021 , pages=

  69. [123]

    2020 , journal=

    A Neural Scaling Law from the Dimension of the Data Manifold , author=. 2020 , journal=. doi:10.48550/arXiv.2004.10802 , url=. 2004.10802 , note=

  70. [124]

    2021 , journal=

    Scaling Laws for Neural Machine Translation , author=. 2021 , journal=. doi:10.48550/arXiv.2109.07740 , url=. 2109.07740 , note=

  71. [125]

    Training Compute-Optimal Large Language Models

    Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

  72. [126]

    An Empirical Model of Large-Batch Training

    An empirical model of large-batch training , author=. arXiv preprint arXiv:1812.06162 , year=

  73. [127]

    Advances in neural information processing systems , volume=

    Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model , author=. Advances in neural information processing systems , volume=

  74. [128]

    What Language Model to Train if You Have One Million

    Le Scao, Teven and Wang, Thomas and Hesslow, Daniel and Saulnier, Lucile and Bekman, Stas and Bari, M Saiful and Biderman, Stella and Elsahar, Hady and Phang, Jason and Press, Ofir and others , booktitle=. What Language Model to Train if You Have One Million

  75. [129]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2101.00027 , url=

  76. [130]

    Datasheet for the

    Biderman, Stella and Bicheno, Kieran and Gao, Leo , year=. Datasheet for the. Computing Research Repository , eprint=. doi:10.48550/arXiv.2201.07311 , url=

  77. [131]

    The BigScience

    Hugo Lauren. The BigScience. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  78. [132]

    7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=

    Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures , author=. 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=

  79. [133]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  80. [134]

    Documenting Geographically and Contextually Diverse Data Sources: The

    McMillan-Major, Angelina and Alyafeai, Zaid and Biderman, Stella and Chen, Kimbo and De Toni, Francesco and Dupont, G. Documenting Geographically and Contextually Diverse Data Sources: The. 2022 , journal=. doi:10.48550/arXiv.2201.10066 , url=. 2201.10066 , note=

Showing first 80 references.