Recognition: 2 theorem links
· Lean TheoremPythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pith reviewed 2026-05-15 17:39 UTC · model grok-4.3
The pith
A suite of 16 language models trained on identical public data in the same order from 70M to 12B parameters enables direct tracking of how abilities emerge during training and across scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics.
What carries the argument
The Pythia suite of identically ordered training runs across scales with full checkpoint releases and data-loader reconstruction tools, which allow side-by-side comparison of training trajectories.
If this is right
- Memorization can be measured as a function of training step and model size under fixed data conditions.
- Effects of term frequency on few-shot accuracy become isolatable across the full range of scales.
- Targeted interventions for reducing gender bias can be tested while holding data order and content constant.
- Scaling trends in capability emergence can be examined with finer temporal resolution than single-final-checkpoint studies allow.
Where Pith is reading between the lines
- The controlled setup could serve as a reference point for testing whether observed training dynamics hold under different data orders or sources.
- Researchers might now run targeted ablations on individual training stages by restarting from specific checkpoints.
- The public data loaders open the possibility of quantifying how much variance in LLM behavior is attributable to data sequence versus model size alone.
Load-bearing premise
That training all models on the exact same public data in identical order, combined with released checkpoints, will produce reproducible and generalizable insights into training dynamics without major unaccounted confounding from data selection or implementation details.
What would settle it
Independent labs using the released checkpoints and data loaders obtain inconsistent results on the memorization or bias case studies when they vary only random seeds or minor implementation details.
read the original abstract
How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Pythia, a suite of 16 LLMs (70M–12B parameters) trained on identical public data in fixed order, releasing 154 checkpoints per model plus dataloader reconstruction tools. Case studies illustrate applications to memorization, term-frequency effects on few-shot performance, and gender-bias reduction, arguing that the controlled setup yields novel insights into training dynamics and scaling.
Significance. If the artifacts match the description, the contribution is significant: a public, reproducible resource that removes data-order confounding for studies of LLM training. The explicit release of models, checkpoints, training code, and dataloader tools directly supports reproducible research on emergence and scaling, a clear strength for the field.
minor comments (3)
- [Abstract] Abstract: the phrase 'novel results' on memorization and bias would be strengthened by a single quantitative comparison (e.g., delta in memorization rate versus prior work) rather than qualitative assertion.
- [Section 3] Section 3: the exact tokenization and packing details for the shared dataloader should be stated explicitly so that independent re-implementations can match the released checkpoints without ambiguity.
- [Figure 2] Figure 2 and associated text: axis labels and legend entries are too small for print; increasing font size would improve readability of the scaling curves.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the Pythia suite, recognition of its significance for reproducible research on training dynamics, and recommendation to accept. No major comments were raised in the report.
Circularity Check
No significant circularity; contribution is artifact release
full rationale
The paper's core claim is the introduction and public release of 16 models (70M–12B) trained on identical public data in fixed order, with 154 checkpoints each and dataloader reconstruction tools. No derivation chain, equations, or first-principles predictions are present; the work consists of controlled empirical artifacts and case-study demonstrations. These do not reduce to fitted parameters, self-citations, or ansatzes by construction. The setup is self-contained against external benchmarks via public data and code, yielding a score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models trained on the same public data in identical order will produce comparable and analyzable training trajectories across scales.
Forward citations
Cited by 21 Pith papers
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Theoretical Limits of Language Model Alignment
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
-
Layer Collapse in Diffusion Language Models
Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
-
Layer Collapse in Diffusion Language Models
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
-
DDO-RM: Distribution-Level Policy Improvement after Reward Learning
DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
-
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner
A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.
-
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
-
Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs
Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in ex...
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., et al
Ahdritz, G., Bouatta, N., Kadyan, S., Xia, Q., Gerecke, W., O'Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., et al. Openfold: Retraining alphafold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022
work page 2022
-
[2]
GPT-NeoX : Large scale autoregressive language modeling in PyTorch , 8 2021
Andonian, A., Anthony, Q., Biderman, S., Black, S., Gali, P., Gao, L., Hallahan, E., Levy-Kramer, J., Leahy, C., Nestler, L., Parker, K., Pieler, M., Purohit, S., Songz, T., Phil, W., and Weinbach, S. GPT-NeoX : Large scale autoregressive language modeling in PyTorch , 8 2021. URL https://www.github.com/eleutherai/gpt-neox
work page 2021
-
[3]
H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N
Bach, S. H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-David, S., Xu, C., Chhablani, G., Wang, H., Fries, J. A., Al-shaibani, M. S., Sharma, S., Thakker, U., Almubarak, K., Tang, X., Tang, X., Jiang, M. T.-J., and Rush, A. M. Promptsource: An ...
work page 2022
-
[6]
S., Sutawika, L., Purohit, S., Schoelkopf, H., Anthony, Q., and Raff, E
Biderman, S., Prashanth, U. S., Sutawika, L., Purohit, S., Schoelkopf, H., Anthony, Q., and Raff, E. Emergent and predictable memorization in large language models. Preprint under review, 2023
work page 2023
-
[8]
GPT-Neo : Large scale autoregressive language modeling with Mesh-TensorFlow
Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. GPT-Neo : Large scale autoregressive language modeling with Mesh-TensorFlow . GitHub, 2021. URL https://www.github.com/eleutherai/gpt-neo
work page 2021
-
[9]
GPT-NeoX-20B : An open-source autoregressive language model
Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. GPT-NeoX-20B : An open-source autoregressive language model. In Proceedings of BigScience Episode \#5--Workshop on Challenges & Perspectives in Creating Large Language Models, pp.\ 95--136, 2022
work page 2022
-
[11]
Bordia, S. and Bowman, S. Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 7--15, 2019
work page 2019
-
[12]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...
work page 1901
-
[13]
Understanding the origins of bias in word embeddings
Brunet, M.-E., Alkalay-Houlihan, C., Anderson, A., and Zemel, R. Understanding the origins of bias in word embeddings. In International conference on machine learning, pp.\ 803--811. PMLR, 2019
work page 2019
-
[14]
The secret sharer: Evaluating and testing unintended memorization in neural networks
Carlini, N., Liu, C., Erlingsson, \'U ., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp.\ 267--284, 2019
work page 2019
-
[15]
Extracting training data from large language models
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650, 2021
work page 2021
-
[17]
Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus
Caswell, I., Breiner, T., van Esch, D., and Bapna, A. Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pp.\ 6588--6608, 2020
work page 2020
-
[19]
Choenni, R., Shutova, E., and van Rooij, R. Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1477--1491, 2021
work page 2021
-
[24]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus
Dodge, J., Sap, M., Marasovi \'c , A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1286--1305, 2021
work page 2021
- [25]
-
[28]
On the sizes of openai api models
Gao, L. On the sizes of openai api models. EleutherAI Blog, 2021
work page 2021
-
[33]
Debiasing pre-trained language models via efficient fine-tuning
Gira, M., Zhang, R., and Lee, K. Debiasing pre-trained language models via efficient fine-tuning. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, pp.\ 59--69, 2022
work page 2022
-
[39]
Quantifying societal bias amplification in image captioning
Hirota, Y., Nakashima, Y., and Garcia, N. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13450--13459, 2022
work page 2022
-
[41]
Hu, H., Salcic, Z., Sun, L., Dobbie, G., Yu, P. S., and Zhang, X. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54 0 (11s): 0 1--37, 2022
work page 2022
-
[45]
S., Subramani, N., Johnson, I., et al
Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., Tan, S., Luccioni, A. S., Subramani, N., Johnson, I., et al. Data governance in the age of large-scale data-driven language technology. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.\ 2206--2222, 2022
work page 2022
-
[46]
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017
work page 2017
-
[47]
Highly accurate protein structure prediction with alphafold
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' i dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. Nature, 596 0 (7873): 0 583--589, 2021
work page 2021
-
[50]
D., Potts, C., Ré, C., and Liang, P
Karamcheti, S., Orr, L., Bolton, J., Zhang, T., Goel, K., Narayan, A., Bommasani, R., Narayanan, D., Hashimoto, T., Jurafsky, D., Manning, C. D., Potts, C., Ré, C., and Liang, P. Mistral - a journey towards reproducible language model training, 2021. URL https://github.com/stanford-crfm/mistral
work page 2021
-
[53]
Lauren c on, H., Saulnier, L., Wang, T., Akiki, C., del Moral, A. V., Scao, T. L., Werra, L. V., Mou, C., Ponferrada, E. G., Nguyen, H., Frohberg, J., S a s ko, M., Lhoest, Q., McMillan-Major, A., Dupont, G., Biderman, S., Rogers, A., allal, L. B., Toni, F. D., Pistilli, G., Nguyen, O., Nikpoor, S., Masoud, M., Colombo, P., de la Rosa, J., Villegas, P., T...
work page 2022
-
[54]
S., Biderman, S., Elsahar, H., Phang, J., Press, O., et al
Le Scao, T., Wang, T., Hesslow, D., Saulnier, L., Bekman, S., Bari, M. S., Biderman, S., Elsahar, H., Phang, J., Press, O., et al. What language model to train if you have one million GPU hours? In Proceedings of BigScience Episode \#5--Workshop on Challenges & Perspectives in Creating Large Language Models, 2022
work page 2022
-
[55]
Deduplicating training data makes language models better
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating training data makes language models better. In Annual Meeting of the Association for Computational Linguistics, 2021
work page 2021
-
[56]
Collecting a large-scale gender bias dataset for coreference resolution and machine translation
Levy, S., Lazar, K., and Stanovsky, G. Collecting a large-scale gender bias dataset for coreference resolution and machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.\ 2470--2480, 2021
work page 2021
-
[57]
Jurassic-1 : Technical details and evaluation
Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1 : Technical details and evaluation. White Paper. AI21 Labs, 2021
work page 2021
-
[58]
Mallen, A., Asai, A., Zhong, V., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022
-
[68]
Scaling effect of self-supervised speech models
Pu, J., Yang, Y., Li, R., Elibol, O., and Droppo, J. Scaling effect of self-supervised speech models. Proc. Interspeech 2021, pp.\ 1084--1088, 2021
work page 2021
-
[69]
Language models are unsupervised multitask learners
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://openai.com/blog/better-language-models/
work page 2019
-
[70]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 01 2020. ISSN 1532-4435. URL http://jmlr.org/papers/v21/20-074.html
work page 2020
-
[74]
High-resolution image synthesis with latent diffusion models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022
work page 2022
-
[79]
On the effect of pretraining corpora on in-context learning by a large-scale language model
Shin, S., Lee, S.-W., Ahn, H., Kim, S., Kim, H., Kim, B., Cho, K., Lee, G., Park, W., Ha, J.-W., and Sung, N. On the effect of pretraining corpora on in-context learning by a large-scale language model. 2022
work page 2022
-
[81]
Cross-loss influence functions to explain deep network representations
Silva, A., Chopra, R., and Gombolay, M. Cross-loss influence functions to explain deep network representations. In International Conference on Artificial Intelligence and Statistics, pp.\ 1--17. PMLR, 2022
work page 2022
-
[84]
Su \'a rez, P. J. O., Sagot, B., and Romary, L. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache, 2019
work page 2019
-
[85]
Tang, J. WuDao : Pretrain the world. Keynote adress at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2021
work page 2021
- [86]
-
[88]
The birth of bias: A case study on the evolution of gender bias in an english language model
Van der Wal, O., Jumelet, J., Schulz, K., and Zuidema, W. The birth of bias: A case study on the evolution of gender bias in an english language model. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp.\ 75--75, 2022 b
work page 2022
-
[89]
Wang, B. and Komatsuzaki, A. GPT-J-6B : A 6 billion parameter autoregressive language model, 2021
work page 2021
-
[91]
V., Pasunuru, R., Chen, D., Zettlemoyer, L., and Stoyanov, V
Xia, M., Artetxe, M., Zhou, C., Lin, X. V., Pasunuru, R., Chen, D., Zettlemoyer, L., and Stoyanov, V. Training trajectories of language models across scales, 2022. URL https://arxiv.org/abs/2212.09803
-
[93]
Yoon, S. and Lee, H. Which model is helpful in solving privacy, memorization, and bias problems? 2021. URL https://soyoung97.github.io/profile/assets/papers/CS774.pdf
work page 2021
-
[95]
Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019
work page 2019
-
[98]
Gender bias in coreference resolution: Evaluation and debiasing methods
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.\ 15--20, 2018
work page 2018
-
[99]
Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=
-
[100]
Journal of Machine Learning Research , issn=
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , issn=. 2020 , month=
work page 2020
-
[101]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor=
Unsupervised Cross-lingual Representation Learning at Scale , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor=. 2020 , month=. doi:10.18653/v1/2020.acl-main.747 , pages=
-
[102]
Computing Research Repository , eprint=
Lifting the Curse of Multilinguality by Pre-training Modular Transformers , author=. Computing Research Repository , eprint=. 2022 , month=. doi:10.48550/arXiv.2205.06266 , url=
-
[103]
On the sizes of OpenAI API models , author=. EleutherAI Blog , year=
-
[104]
Computing Research Repository , eprint=
Scaling Laws and Interpretability of Learning from Repeated Data , author=. Computing Research Repository , eprint=. 2022 , month=. doi:10.48550/arXiv.2205.10487 , url=
-
[105]
Advances in Neural Information Processing Systems , pages =
Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , pages =
-
[106]
Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , journal=
-
[107]
Wang, Ben and Komatsuzaki, Aran , year=
-
[108]
Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , booktitle=
-
[109]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. 2021 , journal=. doi:10.48550/arXiv.2110.08207 , url=. 2110.08207 , note=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.08207 2021
-
[110]
Finetuned Language Models Are Zero-Shot Learners , author=. 2021 , journal=. doi:10.48550/arXiv.2109.0165 , url=. 2109.0165 , note=
-
[111]
Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav , journal=
-
[112]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi and Mostofa Patwary and Raul Puri and Patrick LeGresley and Jared Casper and Bryan Catanzaro , title =. 2019 , journal=. doi:10.48550/arXiv.1909.08053 , url=. 1909.08053 , note=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.08053 2019
-
[113]
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. 2022 , journal=. doi:10.48550/arXiv.2201.11990 , url=. 2201.11990 , note=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11990 2022
-
[114]
Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel , url =. 2021 , Version =. doi:10.5281/zenodo.5879544 , month =
-
[115]
Computing Research Repository , eprint=
Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , year=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2204.0231 , url=
-
[116]
Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ilić, Suzana and Hesslow, Daniel and Castagné, Roman and Luccioni, Alexandra Sasha and Yvon, François and Gallé, Matthias and Tow, Jonathan and Rush, Alexander M. and Biderman, Stella and Webson, Albert and Ammanamanchi, Pawan Sasanka and Wang, Thomas and Sagot, Benoît and Muenni...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.05100
-
[117]
EleutherAI: Going Beyond" Open Science" to" Science in the Open" , author=. arXiv preprint arXiv:2210.06413 , year=
-
[118]
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models , author=. 2020 , journal=. doi:10.48550/arXiv.2001.08361 , url=. 2001.08361 , note=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020
-
[119]
Scaling Laws for Autoregressive Generative Modeling
Scaling Laws for Autoregressive Generative Modeling , author=. 2020 , journal=. doi:10.48550/arXiv.2010.14701 , url=. 2010.14701 , note=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.14701 2020
-
[120]
Scaling Laws for Transfer , author=. 2021 , journal=. doi:10.48550/arXiv.2102.01293 , url=. 2102.01293 , note=
-
[121]
A Scaling Law for Synthetic-to-Real Transfer: How Much Is Your Pre-training Effective? , author=. 2021 , journal=. doi:10.48550/arXiv.2108.11018 , url=. 2108.11018 , note=
-
[122]
Scaling Effect of Self-Supervised Speech Models , author=. Proc. Interspeech 2021 , pages=
work page 2021
-
[123]
A Neural Scaling Law from the Dimension of the Data Manifold , author=. 2020 , journal=. doi:10.48550/arXiv.2004.10802 , url=. 2004.10802 , note=
-
[124]
Scaling Laws for Neural Machine Translation , author=. 2021 , journal=. doi:10.48550/arXiv.2109.07740 , url=. 2109.07740 , note=
-
[125]
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[126]
An Empirical Model of Large-Batch Training
An empirical model of large-batch training , author=. arXiv preprint arXiv:1812.06162 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[127]
Advances in neural information processing systems , volume=
Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model , author=. Advances in neural information processing systems , volume=
-
[128]
What Language Model to Train if You Have One Million
Le Scao, Teven and Wang, Thomas and Hesslow, Daniel and Saulnier, Lucile and Bekman, Stas and Bari, M Saiful and Biderman, Stella and Elsahar, Hady and Phang, Jason and Press, Ofir and others , booktitle=. What Language Model to Train if You Have One Million
-
[129]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2101.00027 , url=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2101.00027
-
[130]
Biderman, Stella and Bicheno, Kieran and Gao, Leo , year=. Datasheet for the. Computing Research Repository , eprint=. doi:10.48550/arXiv.2201.07311 , url=
-
[131]
Hugo Lauren. The BigScience. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[132]
7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures , author=. 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , year=
-
[133]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
-
[134]
Documenting Geographically and Contextually Diverse Data Sources: The
McMillan-Major, Angelina and Alyafeai, Zaid and Biderman, Stella and Chen, Kimbo and De Toni, Francesco and Dupont, G. Documenting Geographically and Contextually Diverse Data Sources: The. 2022 , journal=. doi:10.48550/arXiv.2201.10066 , url=. 2201.10066 , note=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.