arxiv: 2304.01373 · v2 · submitted 2023-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman , Hailey Schoelkopf , Quentin Anthony , Herbie Bradley , Kyle O'Brien , Eric Hallahan , Mohammad Aflah Khan , Shivanshu Purohit

show 5 more authors

USVSN Sai Prashanth Edward Raff Aviya Skowron Lintang Sutawika Oskar van der Wal

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelstraining dynamicsmodel scalingcheckpointsmemorizationgender biasfew-shot learningpublic datasets

0 comments

The pith

A suite of 16 language models trained on identical public data in the same order from 70M to 12B parameters enables direct tracking of how abilities emerge during training and across scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Pythia as a controlled collection of 16 large language models trained on the exact same sequence of public data. The models span sizes from 70 million to 12 billion parameters, and the authors release 154 checkpoints per model plus tools to reconstruct the precise training data loaders. This setup supports studies of how model behaviors develop step by step through training and how those patterns shift as size grows. The authors demonstrate its value with case studies on memorization patterns, how word frequency influences few-shot performance, and methods for lowering gender bias. By making the full training history public, the suite aims to make research on LLM training dynamics more reproducible.

Core claim

We introduce Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics.

What carries the argument

The Pythia suite of identically ordered training runs across scales with full checkpoint releases and data-loader reconstruction tools, which allow side-by-side comparison of training trajectories.

If this is right

Memorization can be measured as a function of training step and model size under fixed data conditions.
Effects of term frequency on few-shot accuracy become isolatable across the full range of scales.
Targeted interventions for reducing gender bias can be tested while holding data order and content constant.
Scaling trends in capability emergence can be examined with finer temporal resolution than single-final-checkpoint studies allow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The controlled setup could serve as a reference point for testing whether observed training dynamics hold under different data orders or sources.
Researchers might now run targeted ablations on individual training stages by restarting from specific checkpoints.
The public data loaders open the possibility of quantifying how much variance in LLM behavior is attributable to data sequence versus model size alone.

Load-bearing premise

That training all models on the exact same public data in identical order, combined with released checkpoints, will produce reproducible and generalizable insights into training dynamics without major unaccounted confounding from data selection or implementation details.

What would settle it

Independent labs using the released checkpoints and data loaders obtain inconsistent results on the memorization or bias case studies when they vary only random seeds or minor implementation details.

read the original abstract

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pythia ships a controlled set of 16 models (70M-12B) trained on identical public data in fixed order with 154 checkpoints each plus dataloader tools; the resource itself is the main value.

read the letter

Pythia stands out for training 16 models from 70M to 12B on the exact same public data in the same order, releasing 154 checkpoints per model, and giving code to reconstruct the dataloaders. That level of control removes the usual data-order confounder when comparing across scales, which is the practical advance here. Releasing the models, checkpoints, training code, and data publicly is done cleanly and lowers the barrier for anyone who wants to study how training dynamics or scaling effects play out without rebuilding everything from scratch. The case studies on memorization, term-frequency effects in few-shot learning, and gender bias serve as quick demonstrations that the suite can surface patterns, but they function more as illustrations than as the central results. The claims look plausible from the setup, yet they would need tighter comparisons to earlier work to stand on their own. A minor soft spot is that fixing data order does not automatically rule out every other implementation detail that might differ across model sizes, though the paper focuses on the largest controllable variable. This is aimed at researchers who work on LLM training dynamics, scaling laws, or interpretability and need reproducible baselines. The artifact is worth citing for follow-up experiments and belongs in a reading group. It deserves peer review because the controlled release and tooling are concrete and usable even if the analysis sections stay light.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Pythia, a suite of 16 LLMs (70M–12B parameters) trained on identical public data in fixed order, releasing 154 checkpoints per model plus dataloader reconstruction tools. Case studies illustrate applications to memorization, term-frequency effects on few-shot performance, and gender-bias reduction, arguing that the controlled setup yields novel insights into training dynamics and scaling.

Significance. If the artifacts match the description, the contribution is significant: a public, reproducible resource that removes data-order confounding for studies of LLM training. The explicit release of models, checkpoints, training code, and dataloader tools directly supports reproducible research on emergence and scaling, a clear strength for the field.

minor comments (3)

[Abstract] Abstract: the phrase 'novel results' on memorization and bias would be strengthened by a single quantitative comparison (e.g., delta in memorization rate versus prior work) rather than qualitative assertion.
[Section 3] Section 3: the exact tokenization and packing details for the shared dataloader should be stated explicitly so that independent re-implementations can match the released checkpoints without ambiguity.
[Figure 2] Figure 2 and associated text: axis labels and legend entries are too small for print; increasing font size would improve readability of the scaling curves.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the Pythia suite, recognition of its significance for reproducible research on training dynamics, and recommendation to accept. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; contribution is artifact release

full rationale

The paper's core claim is the introduction and public release of 16 models (70M–12B) trained on identical public data in fixed order, with 154 checkpoints each and dataloader reconstruction tools. No derivation chain, equations, or first-principles predictions are present; the work consists of controlled empirical artifacts and case-study demonstrations. These do not reduce to fitted parameters, self-citations, or ansatzes by construction. The setup is self-contained against external benchmarks via public data and code, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that identical data ordering across model sizes removes confounding variables; no free parameters or new entities are introduced beyond the model suite itself.

axioms (1)

domain assumption Large language models trained on the same public data in identical order will produce comparable and analyzable training trajectories across scales.
Invoked to justify the utility of the shared training setup for studying dynamics.

pith-pipeline@v0.9.0 · 5523 in / 1280 out tokens · 54916 ms · 2026-05-15T17:39:56.487514+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Theoretical Limits of Language Model Alignment
cs.LG 2026-05 unverdicted novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
Layer Collapse in Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
Layer Collapse in Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
DDO-RM: Distribution-Level Policy Improvement after Reward Learning
stat.ML 2026-04 unverdicted novelty 7.0

DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
cs.LG 2026-05 unverdicted novelty 6.0

Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
Feature Starvation as Geometric Instability in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
cs.LG 2026-05 unverdicted novelty 6.0

A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner
cs.LG 2026-04 unverdicted novelty 6.0

A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
cs.LG 2026-04 unverdicted novelty 6.0

RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs
cs.CL 2026-04 unverdicted novelty 6.0

Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in ex...
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
MiniLLM: On-Policy Distillation of Large Language Models
cs.CL 2023-06 conditional novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
cs.CL 2023-06 unverdicted novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
cs.CL 2023-05 conditional novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

198 extracted references · 198 canonical work pages · cited by 19 Pith papers · 25 internal anchors

[1]

J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., et al

Ahdritz, G., Bouatta, N., Kadyan, S., Xia, Q., Gerecke, W., O'Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., et al. Openfold: Retraining alphafold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022

work page 2022
[2]

GPT-NeoX : Large scale autoregressive language modeling in PyTorch , 8 2021

Andonian, A., Anthony, Q., Biderman, S., Black, S., Gali, P., Gao, L., Hallahan, E., Levy-Kramer, J., Leahy, C., Nestler, L., Parker, K., Pieler, M., Purohit, S., Songz, T., Phil, W., and Weinbach, S. GPT-NeoX : Large scale autoregressive language modeling in PyTorch , 8 2021. URL https://www.github.com/eleutherai/gpt-neox

work page 2021
[3]

H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N

Bach, S. H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-David, S., Xu, C., Chhablani, G., Wang, H., Fries, J. A., Al-shaibani, M. S., Sharma, S., Thakker, U., Almubarak, K., Tang, X., Tang, X., Jiang, M. T.-J., and Rush, A. M. Promptsource: An ...

work page 2022
[6]

S., Sutawika, L., Purohit, S., Schoelkopf, H., Anthony, Q., and Raff, E

Biderman, S., Prashanth, U. S., Sutawika, L., Purohit, S., Schoelkopf, H., Anthony, Q., and Raff, E. Emergent and predictable memorization in large language models. Preprint under review, 2023

work page 2023
[8]

GPT-Neo : Large scale autoregressive language modeling with Mesh-TensorFlow

Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. GPT-Neo : Large scale autoregressive language modeling with Mesh-TensorFlow . GitHub, 2021. URL https://www.github.com/eleutherai/gpt-neo

work page 2021
[9]

GPT-NeoX-20B : An open-source autoregressive language model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. GPT-NeoX-20B : An open-source autoregressive language model. In Proceedings of BigScience Episode \#5--Workshop on Challenges & Perspectives in Creating Large Language Models, pp.\ 95--136, 2022

work page 2022
[11]

and Bowman, S

Bordia, S. and Bowman, S. Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 7--15, 2019

work page 2019
[12]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

work page 1901
[13]

Understanding the origins of bias in word embeddings

Brunet, M.-E., Alkalay-Houlihan, C., Anderson, A., and Zemel, R. Understanding the origins of bias in word embeddings. In International conference on machine learning, pp.\ 803--811. PMLR, 2019

work page 2019
[14]

The secret sharer: Evaluating and testing unintended memorization in neural networks

Carlini, N., Liu, C., Erlingsson, \'U ., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp.\ 267--284, 2019

work page 2019
[15]

Extracting training data from large language models

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650, 2021

work page 2021
[17]

Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus

Caswell, I., Breiner, T., van Esch, D., and Bapna, A. Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pp.\ 6588--6608, 2020

work page 2020
[19]

Choenni, R., Shutova, E., and van Rooij, R. Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1477--1491, 2021

work page 2021
[24]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Dodge, J., Sap, M., Marasovi \'c , A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1286--1305, 2021

work page 2021
[25]

D., et al

D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M. D., et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 2020

work page 2020
[28]

On the sizes of openai api models

Gao, L. On the sizes of openai api models. EleutherAI Blog, 2021

work page 2021
[33]

Debiasing pre-trained language models via efficient fine-tuning

Gira, M., Zhang, R., and Lee, K. Debiasing pre-trained language models via efficient fine-tuning. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, pp.\ 59--69, 2022

work page 2022
[39]

Quantifying societal bias amplification in image captioning

Hirota, Y., Nakashima, Y., and Garcia, N. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13450--13459, 2022

work page 2022
[41]

S., and Zhang, X

Hu, H., Salcic, Z., Sun, L., Dobbie, G., Yu, P. S., and Zhang, X. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54 0 (11s): 0 1--37, 2022

work page 2022
[45]

S., Subramani, N., Johnson, I., et al

Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., Tan, S., Luccioni, A. S., Subramani, N., Johnson, I., et al. Data governance in the age of large-scale data-driven language technology. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.\ 2206--2222, 2022

work page 2022
[46]

S., and Zettlemoyer, L

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

work page 2017
[47]

Highly accurate protein structure prediction with alphafold

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' i dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. Nature, 596 0 (7873): 0 583--589, 2021

work page 2021
[50]

D., Potts, C., Ré, C., and Liang, P

Karamcheti, S., Orr, L., Bolton, J., Zhang, T., Goel, K., Narayan, A., Bommasani, R., Narayanan, D., Hashimoto, T., Jurafsky, D., Manning, C. D., Potts, C., Ré, C., and Liang, P. Mistral - a journey towards reproducible language model training, 2021. URL https://github.com/stanford-crfm/mistral

work page 2021
[53]

V., Scao, T

Lauren c on, H., Saulnier, L., Wang, T., Akiki, C., del Moral, A. V., Scao, T. L., Werra, L. V., Mou, C., Ponferrada, E. G., Nguyen, H., Frohberg, J., S a s ko, M., Lhoest, Q., McMillan-Major, A., Dupont, G., Biderman, S., Rogers, A., allal, L. B., Toni, F. D., Pistilli, G., Nguyen, O., Nikpoor, S., Masoud, M., Colombo, P., de la Rosa, J., Villegas, P., T...

work page 2022
[54]

S., Biderman, S., Elsahar, H., Phang, J., Press, O., et al

Le Scao, T., Wang, T., Hesslow, D., Saulnier, L., Bekman, S., Bari, M. S., Biderman, S., Elsahar, H., Phang, J., Press, O., et al. What language model to train if you have one million GPU hours? In Proceedings of BigScience Episode \#5--Workshop on Challenges & Perspectives in Creating Large Language Models, 2022

work page 2022
[55]

Deduplicating training data makes language models better

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating training data makes language models better. In Annual Meeting of the Association for Computational Linguistics, 2021

work page 2021
[56]

Collecting a large-scale gender bias dataset for coreference resolution and machine translation

Levy, S., Lazar, K., and Stanovsky, G. Collecting a large-scale gender bias dataset for coreference resolution and machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.\ 2470--2480, 2021

work page 2021
[57]

Jurassic-1 : Technical details and evaluation

Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1 : Technical details and evaluation. White Paper. AI21 Labs, 2021

work page 2021
[58]

When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories

Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

work page arXiv 2022
[68]

Scaling effect of self-supervised speech models

Pu, J., Yang, Y., Li, R., Elibol, O., and Droppo, J. Scaling effect of self-supervised speech models. Proc. Interspeech 2021, pp.\ 1084--1088, 2021

work page 2021
[69]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://openai.com/blog/better-language-models/

work page 2019
[70]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 01 2020. ISSN 1532-4435. URL http://jmlr.org/papers/v21/20-074.html

work page 2020
[74]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022

work page 2022
[79]

On the effect of pretraining corpora on in-context learning by a large-scale language model

Shin, S., Lee, S.-W., Ahn, H., Kim, S., Kim, H., Kim, B., Cho, K., Lee, G., Park, W., Ha, J.-W., and Sung, N. On the effect of pretraining corpora on in-context learning by a large-scale language model. 2022

work page 2022
[81]

Cross-loss influence functions to explain deep network representations

Silva, A., Chopra, R., and Gombolay, M. Cross-loss influence functions to explain deep network representations. In International Conference on Artificial Intelligence and Statistics, pp.\ 1--17. PMLR, 2022

work page 2022
[84]

Su \'a rez, P. J. O., Sagot, B., and Romary, L. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache, 2019

work page 2019
[85]

WuDao : Pretrain the world

Tang, J. WuDao : Pretrain the world. Keynote adress at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2021

work page 2021
[86]

Tirumala, K. N. B., Markosyan, A. H., Zettlemoyer, L., and Aghajanyan, A. Memorization without overfitting: Analyzing the training dynamics of large language models. ArXiv, abs/2205.10770, 2022

work page arXiv 2022
[88]

The birth of bias: A case study on the evolution of gender bias in an english language model

Van der Wal, O., Jumelet, J., Schulz, K., and Zuidema, W. The birth of bias: A case study on the evolution of gender bias in an english language model. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp.\ 75--75, 2022 b

work page 2022
[89]

and Komatsuzaki, A

Wang, B. and Komatsuzaki, A. GPT-J-6B : A 6 billion parameter autoregressive language model, 2021

work page 2021
[91]

V., Pasunuru, R., Chen, D., Zettlemoyer, L., and Stoyanov, V

Xia, M., Artetxe, M., Zhou, C., Lin, X. V., Pasunuru, R., Chen, D., Zettlemoyer, L., and Stoyanov, V. Training trajectories of language models across scales, 2022. URL https://arxiv.org/abs/2212.09803

work page arXiv 2022
[93]

and Lee, H

Yoon, S. and Lee, H. Which model is helpful in solving privacy, memorization, and bias problems? 2021. URL https://soyoung97.github.io/profile/assets/papers/CS774.pdf

work page 2021
[95]

Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019

work page 2019
[98]

Gender bias in coreference resolution: Evaluation and debiasing methods

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.\ 15--20, 2018

work page 2018
[99]

OpenAI Blog , year=

Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

work page
[100]

Journal of Machine Learning Research , issn=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , issn=. 2020 , month=

work page 2020
[101]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor=

Unsupervised Cross-lingual Representation Learning at Scale , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor=. 2020 , month=. doi:10.18653/v1/2020.acl-main.747 , pages=

work page doi:10.18653/v1/2020.acl-main.747 2020
[102]

Computing Research Repository , eprint=

Lifting the Curse of Multilinguality by Pre-training Modular Transformers , author=. Computing Research Repository , eprint=. 2022 , month=. doi:10.48550/arXiv.2205.06266 , url=

work page doi:10.48550/arxiv.2205.06266 2022
[103]

EleutherAI Blog , year=

On the sizes of OpenAI API models , author=. EleutherAI Blog , year=

work page
[104]

Computing Research Repository , eprint=

Scaling Laws and Interpretability of Learning from Repeated Data , author=. Computing Research Repository , eprint=. 2022 , month=. doi:10.48550/arXiv.2205.10487 , url=

work page doi:10.48550/arxiv.2205.10487 2022
[105]

Advances in Neural Information Processing Systems , pages =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , pages =

work page
[106]

Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , journal=

work page
[107]

Wang, Ben and Komatsuzaki, Aran , year=

work page
[108]

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , booktitle=

work page
[109]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. 2021 , journal=. doi:10.48550/arXiv.2110.08207 , url=. 2110.08207 , note=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.08207 2021
[110]

2021 , journal=

Finetuned Language Models Are Zero-Shot Learners , author=. 2021 , journal=. doi:10.48550/arXiv.2109.0165 , url=. 2109.0165 , note=

work page doi:10.48550/arxiv.2109.0165 2021
[111]

Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav , journal=

work page
[112]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi and Mostofa Patwary and Raul Puri and Patrick LeGresley and Jared Casper and Bryan Catanzaro , title =. 2019 , journal=. doi:10.48550/arXiv.1909.08053 , url=. 1909.08053 , note=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.08053 2019
[113]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. 2022 , journal=. doi:10.48550/arXiv.2201.11990 , url=. 2201.11990 , note=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11990 2022
[114]

2021 , Version =

Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel , url =. 2021 , Version =. doi:10.5281/zenodo.5879544 , month =

work page doi:10.5281/zenodo.5879544 2021
[115]

Computing Research Repository , eprint=

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , year=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2204.0231 , url=

work page doi:10.48550/arxiv.2204.0231
[116]

Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ilić, Suzana and Hesslow, Daniel and Castagné, Roman and Luccioni, Alexandra Sasha and Yvon, François and Gallé, Matthias and Tow, Jonathan and Rush, Alexander M. and Biderman, Stella and Webson, Albert and Ammanamanchi, Pawan Sasanka and Wang, Thomas and Sagot, Benoît and Muenni...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.05100
[117]

Open Science

EleutherAI: Going Beyond" Open Science" to" Science in the Open" , author=. arXiv preprint arXiv:2210.06413 , year=

work page arXiv
[118]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. 2020 , journal=. doi:10.48550/arXiv.2001.08361 , url=. 2001.08361 , note=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020
[119]

Scaling Laws for Autoregressive Generative Modeling

Scaling Laws for Autoregressive Generative Modeling , author=. 2020 , journal=. doi:10.48550/arXiv.2010.14701 , url=. 2010.14701 , note=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.14701 2020
[120]

2021 , journal=

Scaling Laws for Transfer , author=. 2021 , journal=. doi:10.48550/arXiv.2102.01293 , url=. 2102.01293 , note=

work page doi:10.48550/arxiv.2102.01293 2021
[121]

2021 , journal=

A Scaling Law for Synthetic-to-Real Transfer: How Much Is Your Pre-training Effective? , author=. 2021 , journal=. doi:10.48550/arXiv.2108.11018 , url=. 2108.11018 , note=

work page doi:10.48550/arxiv.2108.11018 2021
[122]

Scaling Effect of Self-Supervised Speech Models , author=. Proc. Interspeech 2021 , pages=

work page 2021
[123]

2020 , journal=

A Neural Scaling Law from the Dimension of the Data Manifold , author=. 2020 , journal=. doi:10.48550/arXiv.2004.10802 , url=. 2004.10802 , note=

work page doi:10.48550/arxiv.2004.10802 2020
[124]

2021 , journal=

Scaling Laws for Neural Machine Translation , author=. 2021 , journal=. doi:10.48550/arXiv.2109.07740 , url=. 2109.07740 , note=

work page doi:10.48550/arxiv.2109.07740 2021
[125]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[126]

An Empirical Model of Large-Batch Training

An empirical model of large-batch training , author=. arXiv preprint arXiv:1812.06162 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[127]

Advances in neural information processing systems , volume=

Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model , author=. Advances in neural information processing systems , volume=

work page
[128]

What Language Model to Train if You Have One Million

Le Scao, Teven and Wang, Thomas and Hesslow, Daniel and Saulnier, Lucile and Bekman, Stas and Bari, M Saiful and Biderman, Stella and Elsahar, Hady and Phang, Jason and Press, Ofir and others , booktitle=. What Language Model to Train if You Have One Million

work page
[129]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2101.00027 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2101.00027
[130]

Datasheet for the

Biderman, Stella and Bicheno, Kieran and Gao, Leo , year=. Datasheet for the. Computing Research Repository , eprint=. doi:10.48550/arXiv.2201.07311 , url=

work page doi:10.48550/arxiv.2201.07311
[131]

The BigScience

Hugo Lauren. The BigScience. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[132]

7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=

Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures , author=. 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=

work page
[133]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[134]

Documenting Geographically and Contextually Diverse Data Sources: The

McMillan-Major, Angelina and Alyafeai, Zaid and Biderman, Stella and Chen, Kimbo and De Toni, Francesco and Dupont, G. Documenting Geographically and Contextually Diverse Data Sources: The. 2022 , journal=. doi:10.48550/arXiv.2201.10066 , url=. 2201.10066 , note=

work page doi:10.48550/arxiv.2201.10066 2022

Showing first 80 references.