Recognition: 2 theorem links
· Lean TheoremTextbooks Are All You Need II: phi-1.5 technical report
Pith reviewed 2026-05-14 19:13 UTC · model grok-4.3
The pith
A 1.3 billion parameter model trained on synthetic textbooks matches models five times larger on reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
phi-1.5 is a 1.3 billion parameter Transformer trained primarily on synthetic textbook-quality data generated by existing large language models. It reaches performance on natural language tasks comparable to models five times its size and surpasses most non-frontier LLMs on grade-school mathematics and basic coding. The model exhibits traits of much larger systems such as thinking step by step and rudimentary in-context learning, while showing fewer toxic generations because web data was avoided.
What carries the argument
Synthetic textbook-quality data generated by larger LLMs, used in place of web-scraped text to train the smaller model.
If this is right
- Smaller models can display chain-of-thought reasoning and in-context learning when trained on clean synthetic data.
- Avoiding web data reduces the rate of toxic or biased outputs.
- Performance on math and coding tasks can improve through data quality rather than parameter count alone.
- Open-sourcing the model enables community checks on whether the observed abilities generalize beyond the training distribution.
Where Pith is reading between the lines
- Data curation methods may prove more efficient than raw scaling for building capable models on targeted reasoning tasks.
- The same synthetic-data approach could be tested on domains outside language, such as basic science or logic puzzles.
- Future evaluations should include out-of-distribution problems to check whether the model truly reasons or has memorized benchmark styles.
- Open release allows independent tests of whether the reduced toxicity persists when the model is fine-tuned on new data.
Load-bearing premise
Standard grade-school math and coding benchmarks measure genuine reasoning rather than the model simply matching patterns present in the synthetic training data.
What would settle it
A new collection of math and coding problems written to avoid stylistic patterns from the synthetic textbooks; if phi-1.5 performance drops sharply on this set while larger models do not, the central claim is weakened.
read the original abstract
We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces phi-1.5, a 1.3B-parameter Transformer model trained primarily on synthetic 'textbook-quality' data generated by larger LLMs. Building on prior work with TinyStories and phi-1, it focuses on common-sense reasoning in natural language and reports that phi-1.5 achieves performance on natural language tasks comparable to models 5x larger while surpassing most non-frontier LLMs on grade-school mathematics and basic coding. The model is shown to exhibit step-by-step reasoning, rudimentary in-context learning, hallucinations, and reduced toxicity due to the absence of web data; the model weights are open-sourced.
Significance. If the empirical claims hold after proper validation, the result would demonstrate that high-quality synthetic data can substantially narrow the performance gap between small and large models on reasoning tasks, offering a data-centric alternative to pure scaling and potentially reducing reliance on noisy web corpora.
major comments (2)
- [Abstract] Abstract: the central claim that phi-1.5 matches models 5x larger on natural language tasks and surpasses most non-frontier LLMs on grade-school math and coding is stated without any benchmark scores, model-size comparisons, baselines, or statistical details; this absence makes the headline result impossible to evaluate from the provided text.
- [Abstract] The training corpus is generated by larger LLMs; no decontamination statistics, n-gram overlap analysis, or ablation isolating synthetic versus web data on held-out problems (e.g., GSM8K) are supplied, leaving open the possibility that reported gains reflect distributional overlap rather than genuine reasoning improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing that the abstract can be strengthened with concrete metrics and that additional data-quality analyses would improve transparency. We will incorporate these changes in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that phi-1.5 matches models 5x larger on natural language tasks and surpasses most non-frontier LLMs on grade-school math and coding is stated without any benchmark scores, model-size comparisons, baselines, or statistical details; this absence makes the headline result impossible to evaluate from the provided text.
Authors: We agree that the abstract would benefit from explicit benchmark numbers to make the central claims immediately evaluable. The full paper already contains the supporting evidence in Section 4 and Tables 1–3 (e.g., phi-1.5 at 1.3B matches or exceeds several 7B models on ARC, BoolQ, and PIQA while outperforming most non-frontier models on GSM8K and HumanEval). In the revision we will add representative scores and size comparisons directly to the abstract. revision: yes
-
Referee: [Abstract] The training corpus is generated by larger LLMs; no decontamination statistics, n-gram overlap analysis, or ablation isolating synthetic versus web data on held-out problems (e.g., GSM8K) are supplied, leaving open the possibility that reported gains reflect distributional overlap rather than genuine reasoning improvement.
Authors: We acknowledge this is a fair point. Although the synthetic textbook data is generated via prompted larger models with explicit instructions for diversity and step-by-step reasoning, the original submission did not report n-gram overlap or decontamination statistics against held-out sets. We will add these analyses (including 13-gram overlap with GSM8K and an ablation of synthetic-only versus mixed training) in a new subsection of the data section to rule out contamination as the source of gains. revision: yes
Circularity Check
No circularity; performance claims rest on external benchmark evaluations
full rationale
The paper is a technical report describing the training of phi-1.5 on synthetic textbook data generated by larger LLMs and reporting its empirical results on standard natural language, math, and coding benchmarks. No derivation chain, equations, or first-principles predictions are presented that reduce to fitted parameters or self-referential inputs. The central claims are benchmark scores (e.g., comparable to 5x larger models on grade-school math and coding), which are measured against external test sets rather than constructed from the paper's own definitions or prior outputs. Reference to the prior 'Textbooks Are All You Need' work is methodological context only and does not bear the load of the performance results. This is a standard empirical report with no load-bearing self-citation chains or ansatz smuggling.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclearWe follow the “Textbooks Are All You Need” approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named phi-1.5, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding.
-
Foundation.DimensionForcingdimension_forced unclearOur training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world
Forward citations
Cited by 20 Pith papers
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning
TokenUnlearn identifies critical tokens via masking and entropy signals then applies hard selection or soft weighting to unlearn only those tokens, yielding better forgetting and retained utility than sequence-level b...
-
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
-
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
-
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
-
GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization
GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.
-
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...
-
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs
DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.
-
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
[AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Identify, align, and integrate: Matching knowledge graphs to commonsense reasoning tasks
[BB21] Lisa Bauer and Mohit Bansal. Identify, align, and integrate: Matching knowledge graphs to commonsense reasoning tasks. arXiv preprint arXiv:2104.10193 ,
-
[3]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
[BCE+23] S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
[BGMMS21] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages 610–623,
work page 2021
-
[5]
Piqa: Reasoning about physical commonsense in natural language
[BHT+19] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Y Chai, Mirella Lapata, Angeliki Lazaridou, Ryan J Maynez, Piyush Narang, et al. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641,
-
[6]
Training Verifiers to Solve Math Word Problems
[CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher 13 Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Boolq: Exploring the surprising difficulty of natural yes/no questions
[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...
work page 2019
-
[8]
PaLM: Scaling Language Modeling with Pathways
[CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Evaluating Large Language Models Trained on Code
[CTJ+21] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Eval- uating large language models trained on code. arXiv preprint arXiv:2107.03374 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Tinystories: How small can language models be and still speak coherent english?
[EL23] Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759 ,
-
[11]
[Fer21] S´ ebastien Ferr´ e. First steps of an approach to the arc challenge based on descriptive grid models and the minimum description length principle. arXiv preprint arXiv:2112.00848 ,
-
[12]
[GZA+23] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´ esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Gustavo de Rosa Piero Kauffmann, Olli Saarikivia, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S´ ebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv prepr...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Measuring Massive Multitask Language Understanding
[HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[14]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
[HGP+22] Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509 ,
-
[15]
An empirical study of metrics to measure representational harms in pre-trained language models
14 [HPA23] Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. An empirical study of metrics to measure representational harms in pre-trained language models. arXiv preprint arXiv:2301.09211,
-
[16]
The stack: 3 tb of permissively licensed source code
[KLA+22] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜ noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 ,
-
[17]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
[MCKS18] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2303.08774 [cs.CL]. [PMH+23] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Answering questions by learning to rank
[PRR19] George-Sebastian Pˆ ırtoac˘ a, Traian Rebedea, and Stefan Ruseti. Answering questions by learning to rank. arXiv preprint arXiv:1909.00596 ,
-
[20]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
[RZLL16] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
[SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[22]
LLaMA: Open and Efficient Foundation Language Models
[TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Taxonomy of risks posed by language models
15 [WUR+22] Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229,
work page 2022
-
[24]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
[ZCS+23] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 ,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.