arxiv: 2306.11644 · v2 · submitted 2023-06-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Textbooks Are All You Need

Adam Tauman Kalai, Adil Salim, Allie Del Giorno, Caio C\'esar Teodoro Mendes, Gustavo de Rosa, Harkirat Singh Behl, Jyoti Aneja, Mojan Javaheripi, Olli Saarikivi, Piero Kauffmann, Ronen Eldan, S\'ebastien Bubeck, Shital Shah, Sivakanth Gopi, Suriya Gunasekar, Xin Wang, Yin Tat Lee, Yi Zhang, Yuanzhi Li

Pith reviewed 2026-05-13 04:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords code generationsmall language modelssynthetic datatextbook quality dataHumanEvalMBPPemergent abilitiesparameter-efficient training

0 comments

The pith

A 1.3 billion parameter code model trained solely on textbook-quality data reaches 50.6 percent pass@1 on HumanEval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a compact Transformer model can match or exceed the code-generation performance of much larger systems when its training data consists only of carefully filtered textbook-style content and synthetic exercises. This data mix totals seven billion tokens and replaces the usual reliance on broad web scrapes. The resulting phi-1 model exhibits capabilities that appear only after a dedicated exercise fine-tuning stage and that are absent from both its own base checkpoint and a smaller 350 million parameter counterpart trained identically. If the result holds, it indicates that data quality can substitute for scale in the code domain and that training runs can finish in days on a handful of GPUs rather than weeks on clusters.

Core claim

Phi-1 is a 1.3B-parameter Transformer trained for four days on eight A100 GPUs using six billion tokens of web-filtered textbook-quality data plus one billion tokens of GPT-3.5-generated synthetic textbooks and coding exercises. It records 50.6 percent pass@1 on HumanEval and 55.5 percent on MBPP, while also displaying emergent properties absent from the pre-fine-tuning base model and from a 350M-parameter model trained with the identical pipeline.

What carries the argument

The curated training distribution that combines web-selected textbook-quality tokens with GPT-3.5-synthesized textbooks and exercises, which together serve as the complete pretraining and fine-tuning corpus.

If this is right

Code models can reach competitive benchmark scores with far fewer parameters and far less total compute than current scaling trends assume.
A dedicated fine-tuning stage on synthetic coding exercises unlocks abilities not visible after the initial textbook pretraining.
Training runs for capable code models can complete in days on a single node of eight GPUs rather than requiring large clusters.
High-quality synthetic data can replace much of the volume of raw web text currently used for code pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same textbook-plus-synthetic recipe could be tested on non-code domains such as mathematics or basic reasoning to see whether quality-focused data reduces the need for scale more generally.
If the performance edge comes mainly from the synthetic portion, smaller models may become dependent on access to larger models for data generation.
Independent verification that the synthetic exercises contain no benchmark contamination would make the quality-over-quantity claim easier to isolate and replicate.

Load-bearing premise

The filtered textbook subset and the GPT-3.5 synthetic exercises form a superior training distribution that produces the reported gains without hidden contamination or bias.

What would settle it

A new coding benchmark assembled entirely after the data collection date, or an audit that reveals overlap between the synthetic exercises and the HumanEval or MBPP test cases, would show whether the performance depends on data quality alone.

read the original abstract

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces phi-1, a 1.3B-parameter Transformer model for code, trained for 4 days on 8 A100s using 6B tokens of web 'textbook quality' data and 1B tokens of GPT-3.5-generated synthetic textbooks and exercises. It reports pass@1 accuracies of 50.6% on HumanEval and 55.5% on MBPP, along with emergent properties in comparison to phi-1-base and the 350M phi-1-small model.

Significance. Should the results prove robust to data contamination concerns, the work would highlight the potential of high-quality, curated and synthetic data to achieve strong performance with small-scale training, offering a path to more efficient development of specialized language models for code. The explicit reporting of training compute and internal model variants provides a useful reference point for data-quality focused research.

major comments (3)

[§3.2 (Synthetic Data Generation)] The pipeline for generating the 1B tokens of synthetic textbooks and exercises with GPT-3.5 does not report any n-gram, embedding-based, or other decontamination procedures against the HumanEval and MBPP test sets. This is load-bearing for the central claim, as overlap could explain the high accuracies via memorization rather than the superiority of 'textbook quality' data.
[Results (benchmark tables)] The reported accuracies are given as single point estimates without error bars, confidence intervals, or results from multiple training runs, which weakens the ability to claim clear emergent properties or superiority over the base and small variants.
[§4 (Experiments)] There is no comparison to external baselines of similar size (e.g., other ~1B parameter code models trained on standard corpora), making it hard to attribute the performance gains specifically to the textbook data selection rather than other factors like the model architecture or training procedure.

minor comments (2)

[Abstract] The total number of training tokens (7B) and the distinction between pretraining and finetuning stages could be stated more explicitly for clarity.
[Throughout] Ensure all acronyms (e.g., pass@1) are defined on first use, although they are standard in the field.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your review and the insightful comments on our work. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns.

read point-by-point responses

Referee: [§3.2 (Synthetic Data Generation)] The pipeline for generating the 1B tokens of synthetic textbooks and exercises with GPT-3.5 does not report any n-gram, embedding-based, or other decontamination procedures against the HumanEval and MBPP test sets. This is load-bearing for the central claim, as overlap could explain the high accuracies via memorization rather than the superiority of 'textbook quality' data.

Authors: We thank the referee for highlighting this important aspect. While the synthetic data is generated from original prompts designed to create educational content without referencing benchmark problems, we recognize the need for explicit verification. In the revised manuscript, we have added details on the decontamination process: we applied both n-gram overlap checks and embedding-based similarity searches (using sentence embeddings) to filter any potential overlap with HumanEval and MBPP test sets. The results showed negligible overlap, confirming that the performance stems from the quality of the data rather than memorization. revision: yes
Referee: [Results (benchmark tables)] The reported accuracies are given as single point estimates without error bars, confidence intervals, or results from multiple training runs, which weakens the ability to claim clear emergent properties or superiority over the base and small variants.

Authors: We agree that multiple runs would provide better statistical confidence. Due to the significant computational resources required for each training run of the 1.3B model, we report results from a single training run. However, we have included in the revision a discussion of the consistency observed across the model variants (phi-1-base, phi-1-small) and note that the performance differences are substantial (e.g., from 45% to 50.6%), which supports the emergence claim. We also reference similar single-run reporting in related works. revision: partial
Referee: [§4 (Experiments)] There is no comparison to external baselines of similar size (e.g., other ~1B parameter code models trained on standard corpora), making it hard to attribute the performance gains specifically to the textbook data selection rather than other factors like the model architecture or training procedure.

Authors: Our primary goal was to demonstrate the impact of data quality through controlled internal comparisons with phi-1-base and phi-1-small, which share the same architecture and training setup but differ in data and stages. To address the request for external context, we have added comparisons in the revised manuscript to other models of similar size reported in the literature, such as CodeGen-1B and similar variants, using their published HumanEval scores. This shows phi-1 achieving higher performance with less compute, further supporting our claims about the textbook data. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmark reporting

full rationale

The paper describes an experimental pipeline: selection of web textbook-quality data (6B tokens), GPT-3.5 synthetic textbook/exercise generation (1B tokens), training of a 1.3B Transformer, and direct measurement of pass@1 accuracy on the fixed public HumanEval and MBPP suites. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Reported numbers are raw inference results on external benchmarks, not quantities that reduce by construction to any internal fit or self-citation. The work contains no load-bearing self-citations, ansatzes, or uniqueness theorems that would trigger the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard next-token prediction training of a Transformer and the validity of HumanEval/MBPP as measures of code capability; no new mathematical axioms or physical assumptions are introduced.

axioms (1)

domain assumption Next-token prediction on curated text produces useful code generation capabilities.
Implicit in the choice to train a standard language model on the described data mixture.

pith-pipeline@v0.9.0 · 5535 in / 1350 out tokens · 62269 ms · 2026-05-13T04:37:50.545720+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce phi-1... using a selection of 'textbook quality' data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
N-gram overlap... embedding and syntax-based similarity analysis... pruning more than 40% of the CodeExercises dataset

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
MUSCAT: MUltilingual, SCientific ConversATion Benchmark
cs.CL 2026-04 unverdicted novelty 7.0

MUSCAT is a benchmark of bilingual scientific conversations designed to evaluate ASR systems on code-switching and domain-specific challenges.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Large Spectrum Models (LSMs): Decoder-Only Transformer-Powered Spectrum Activity Forecasting via Tokenized RF Data
cs.NI 2026-05 unverdicted novelty 6.0

Decoder-only transformers trained on tokenized RF spectrum data from 22 TB of measurements achieve 3.25 dB RMSE in spectrum activity forecasting across 33 bands.
Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes
cs.CL 2026-05 unverdicted novelty 6.0

Fixed 16-bit binary token codes can replace trainable input embeddings in 32-layer decoder-only models while maintaining comparable held-out perplexity on 17B tokens.
SkillGen: Verified Inference-Time Agent Skill Synthesis
cs.LG 2026-05 unverdicted novelty 6.0

SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
cs.SE 2026-04 unverdicted novelty 6.0

Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference
cs.SE 2026-04 unverdicted novelty 6.0

CoDe-R refines LLM decompiler output via rationale-guided semantic injection and dynamic fallback inference, making a 1.3B model the first to exceed 50% average re-executability on HumanEval-Decompile.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
cs.CL 2024-06 unverdicted novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
cs.AI 2026-05 unverdicted novelty 5.0

Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and...
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking
cs.AI 2026-05 unverdicted novelty 4.0

FinRAG-12B is a production-deployed 12B model for banking that grounds answers with citations, refuses unanswerable queries at a calibrated 12% rate, outperforms GPT-4.1 on grounding, and improves query resolution by ...
Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines
cs.CL 2026-05 unverdicted novelty 4.0

A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
cs.SE 2026-04 unverdicted novelty 4.0

Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 28 Pith papers · 12 internal anchors

[1]

PaLM 2 Technical Report

[ADF+23] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 tech- nical report. arXiv preprint arXiv:2305.10403 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 ,

[ALK+23] Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 ,

work page arXiv
[3]

Program Synthesis with Large Language Models

[AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Physics of language models: Part 1, context-free gram- mar

[AZL23] Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, context-free gram- mar. arXiv preprint arXiv:2305.13673 ,

work page arXiv
[5]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

[BCE+23] S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages 610–623,

[BGMMS21] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages 610–623,

work page 2021
[7]

Efficient training of language models to fill in the middle

[BJT+22] Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255 ,

work page arXiv
[8]

Language models are few-shot learners

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

work page 1901
[9]

PaLM: Scaling Language Modeling with Pathways

[CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 ,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Evaluating Large Language Models Trained on Code

15 [CTJ+21] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Alpacafarm: A simulation framework for methods that learn from human feedback

[DLT+23] Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387 ,

work page arXiv
[12]

Tinystories: How small can language models be and still speak coherent english?

[EL23] Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759 ,

work page arXiv
[13]

arXiv preprint arXiv:2305.15717 , year=

[GWS+23] Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717,

work page arXiv
[14]

Deep Learning Scaling is Predictable, Empirically

[HNA+17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 ,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing

[JWJ+23] Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, and Yejin Choi. Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing. arXiv preprint arXiv:2305.16635 ,

work page arXiv
[16]

The stack: 3 tb of permissively licensed source code

[KLA+22] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜ noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 ,

work page arXiv
[17]

Scaling Laws for Neural Language Models

[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

StarCoder: may the source be with you!

[LAZ+23] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cheng- hao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 ,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Dif- ferentially private synthetic data via foundation model apis 1: Images

16 [LGK+23] Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, Harsha Nori, and Sergey Yekhanin. Dif- ferentially private synthetic data via foundation model apis 1: Images. arXiv preprint arXiv:2305.15560,

work page arXiv
[20]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

[LXWZ23] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 ,

work page internal anchor Pith review arXiv
[21]

A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

[LYR+23] Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, et al. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169 ,

work page arXiv
[22]

Mukherjee, A

[MMJ+23] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707 ,

work page arXiv
[23]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

[MRB+23] Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264 ,

work page arXiv
[24]

Code- gen2: Lessons for training llms on programming and natural languages

[NHX+23] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. Code- gen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309,

work page arXiv
[25]

GPT-4 Technical Report

arXiv preprint arXiv:2303.08774 [cs.CL]. [Rep23] Replit. Replit dev day. https://twitter.com/Replit/status/ 1651344184593506304,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

RoFormer: Enhanced Transformer with Rotary Position Embedding

[SLP+21] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding. arXiv preprint arXiv:2104.09864 ,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2305.17493

[SSZ+23] Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. Model dementia: Generated data makes models forget. arXiv preprint arXiv:2305.17493,

work page arXiv
[28]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

[WKM+22] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self gen- erated instructions. arXiv preprint arXiv:2212.10560 ,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Codet5+: Open code large language models for code understanding and generation

[WLG+23] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 ,

work page arXiv
[30]

[YGK+23] Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zinan Lin, Saurabh Naik, Tomasz Lukasz Religa, Jian Yin, and Huishuai Zhang

Survey Certification. [YGK+23] Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zinan Lin, Saurabh Naik, Tomasz Lukasz Religa, Jian Yin, and Huishuai Zhang. Selective pre-training for private fine-tuning. arXiv preprint arXiv:2305.13865,

work page arXiv