Nothing from Something: Can a Language Model Discover 0?

Brenden M. Lake; Phoebe Zeng; Thomas L. Griffiths

arxiv: 2606.17289 · v2 · pith:X6YQHLKFnew · submitted 2026-06-15 · 💻 cs.AI · cs.CL

Nothing from Something: Can a Language Model Discover 0?

Phoebe Zeng , Thomas L. Griffiths , Brenden M. Lake This is my paper

Pith reviewed 2026-06-27 03:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords language modelszeromathematical discoverygeneralizationarithmeticpretrainingout-of-distribution

0 comments

The pith

Language models of GPT-2 size cannot discover the concept of zero without explicit training examples, though language pretraining cuts the number needed by about 50%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current language models can reach beyond their training data to discover the mathematical concept of zero on their own. It finds that these models do not generalize to arithmetic tasks involving zero when tested without any prior zero examples, even if they received language pretraining. Performance rises sharply once the models see tens or hundreds of zero examples during fine-tuning. Language pretraining lowers the number of such examples required by roughly half, indicating that language skills can help models acquire new mathematical structures with less direct data.

Core claim

Language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but models can improve substantially after training on tens or hundreds of examples of zero. Additionally, language pretraining reduces the number of required examples by approximately 50%, showing that language abilities can scaffold mathematical discovery in neural models.

What carries the argument

The zero-generalization task in simple arithmetic, which measures whether models can extend their training on non-zero numbers to include the new concept of zero.

If this is right

AI systems may need direct exposure to zero examples to acquire basic mathematical concepts.
Language pretraining can reduce the data required for learning new arithmetic structures by half.
Spontaneous mathematical discovery of zero does not occur in these models from non-zero training alone.
Mathematical generalization in neural networks depends more on explicit examples than on pretraining alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger models or different objectives might still fail to invent zero without targeted examples.
This pattern could apply to other missing concepts like negative numbers or fractions.
Historical human invention of zero might parallel the need for cultural or explicit introduction rather than pure inference.

Load-bearing premise

That failure to handle zero at test time without any zero examples proves an inability to discover the concept rather than a limit of the chosen training regime or task setup.

What would settle it

A GPT-2-size model that correctly answers arithmetic questions involving zero after training only on positive numbers or non-zero operations would falsify the main claim.

Figures

Figures reproduced from arXiv: 2606.17289 by Brenden M. Lake, Phoebe Zeng, Thomas L. Griffiths.

**Figure 2.** Figure 2: Language pretraining curves and model perplexity on arithmetic train data after language pretraining. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of model generalization to zero at test time, across training regimes. The training and validation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Model generalization to zero at test time in few [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Final test accuracy on holdout digits 0-9. Zero [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Number of digits with cosine similarity ≥ 0.65 with holdout digits 0-9. Digits that fall in the middle of the range have more “near neighbors”. Training techniques The class of models we focused on in this paper, based on the GPT-2 architecture, provides a way to explore how training on language influences generalization. However, more recent work with larger models has highlighted other ways in which thes… view at source ↗

**Figure 9.** Figure 9: (Open sourced) model generalization to zero at test [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: (Open sourced) model generalization to zero at test [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 13.** Figure 13: (Open sourced) model generalization to zero at test [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 11.** Figure 11: Model generalization to zero at test time with answer [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 17.** Figure 17: Model generalization to digits 0-9 at test time with [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

**Figure 18.** Figure 18: Model generalization to digits 0-7 at test time in the [PITH_FULL_IMAGE:figures/full_fig_p012_18.png] view at source ↗

**Figure 16.** Figure 16: Model generalization to digits 0-9 at test time with [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

**Figure 19.** Figure 19: Model generalization to digits 0-7 at test time in the [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗

**Figure 22.** Figure 22: Model generalization to digits 0-7 at test time in the [PITH_FULL_IMAGE:figures/full_fig_p013_22.png] view at source ↗

**Figure 23.** Figure 23: Model generalization to digits 0-7 at test time in the [PITH_FULL_IMAGE:figures/full_fig_p013_23.png] view at source ↗

**Figure 21.** Figure 21: Model generalization to digits 0-9 at test time. [PITH_FULL_IMAGE:figures/full_fig_p013_21.png] view at source ↗

read the original abstract

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Language pretraining cuts examples needed for zero by half, but test-time failure alone does not securely show the models cannot discover the concept.

read the letter

The paper's clearest result is that language pretraining reduces the number of zero examples required by roughly 50 percent in their arithmetic setup. Models still fail the generalization at test time without any zero examples, but improve after seeing tens or hundreds of them. That quantified pretraining effect is the new piece.

The work tests a straightforward hypothesis about whether language abilities can scaffold mathematical concept learning in neural nets. It uses a simple arithmetic case study and reports a concrete difference between pretrained and non-pretrained models. That kind of targeted comparison is useful for people tracking how pretraining affects out-of-distribution performance on abstract tasks.

The soft spot is the leap from test-time failure to the claim that the models cannot discover zero. The abstract treats the failure as evidence of inability to discover the concept, but does not show that the chosen task and prompting would surface an internal representation of zero if one existed. Without negative controls or alternative elicitation methods that isolate the presence of the concept, the result could reflect limits of the regime rather than limits of the model. The stress-test note correctly flags this as load-bearing.

This is for researchers working on mathematical reasoning and few-shot concept acquisition in language models. It is not a broad theoretical advance, but the pretraining result is specific enough to be checked.

I would send it to peer review. The question is worth referee time and the 50 percent figure gives something concrete to evaluate, even if the interpretation of the test-time result needs tighter justification in the full paper.

Referee Report

2 major / 1 minor

Summary. The paper uses simple arithmetic tasks as a case study to test whether GPT-2-scale language models can discover the concept of zero via out-of-distribution generalization. It claims that (1) such models fail to perform zero generalization at test time regardless of language pretraining, (2) they improve substantially after supervised training on tens or hundreds of zero examples, and (3) language pretraining reduces the number of required examples by approximately 50%.

Significance. If the empirical results hold under rigorous controls, the finding that language pretraining can scaffold acquisition of a new mathematical primitive (zero) by halving the number of examples needed would be a concrete, falsifiable contribution to understanding how pretraining affects mathematical discovery in neural models. The work also supplies a minimal testbed for probing whether models can hypothesize logically stronger structures beyond their training distribution.

major comments (2)

[Abstract / §3] Abstract and §3 (experimental setup): the claim that test-time failure demonstrates an inability to 'discover' zero is load-bearing but rests on the untested assumption that the chosen arithmetic task and prompting regime would elicit the concept if an internal representation existed. No negative controls are described that hold all other factors fixed while varying only the presence/absence of zero, leaving open the possibility that the observed failure reflects task formulation or elicitation limits rather than conceptual absence.
[Abstract / Results] Abstract and results section: no details are supplied on experimental design, data splits, number of runs, statistical tests, or variance across random seeds. Without these, it is impossible to assess whether the reported improvement after 'tens or hundreds of examples' and the 50% reduction due to pretraining are robust or could be artifacts of particular splits or prompting choices.

minor comments (1)

[Abstract] Abstract: the phrase 'language pretraining reduces the number of required examples by approximately 50%' should be accompanied by the precise baseline (e.g., from-scratch vs. pretrained) and the exact metric (examples to reach a given accuracy threshold) used to compute the reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will incorporate revisions to improve experimental rigor and reporting.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (experimental setup): the claim that test-time failure demonstrates an inability to 'discover' zero is load-bearing but rests on the untested assumption that the chosen arithmetic task and prompting regime would elicit the concept if an internal representation existed. No negative controls are described that hold all other factors fixed while varying only the presence/absence of zero, leaving open the possibility that the observed failure reflects task formulation or elicitation limits rather than conceptual absence.

Authors: We agree that explicit negative controls would strengthen the causal interpretation of the test-time failure. In the revised manuscript we will add a new subsection in §3 describing negative-control experiments that hold the arithmetic task, prompting format, and model architecture fixed while systematically varying only the presence or absence of zero in the training distribution. These controls will be reported alongside the original results. revision: yes
Referee: [Abstract / Results] Abstract and results section: no details are supplied on experimental design, data splits, number of runs, statistical tests, or variance across random seeds. Without these, it is impossible to assess whether the reported improvement after 'tens or hundreds of examples' and the 50% reduction due to pretraining are robust or could be artifacts of particular splits or prompting choices.

Authors: We acknowledge that the original submission omitted these methodological details. The revised version will include a new 'Experimental Details' subsection that specifies: (i) the exact train/validation/test splits and how zero examples were held out, (ii) the number of independent runs (five random seeds), (iii) the statistical tests used (paired t-tests with reported p-values), and (iv) all results reported as means ± standard deviation across seeds. These additions will allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with no derivation chain

full rationale

The paper is an empirical investigation of language model generalization on arithmetic tasks involving zero. It reports experimental results on GPT-2 scale models with and without pretraining, showing failure at test time without examples and improvement after fine-tuning. No mathematical derivation, equations, or theoretical chain is presented that could reduce to its inputs by construction. Claims rest on observable training outcomes and data splits, which are externally falsifiable via replication. Self-citations, if present, are not load-bearing for any central premise that would create circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work is purely empirical.

pith-pipeline@v0.9.1-grok · 5700 in / 1037 out tokens · 47854 ms · 2026-06-27T03:19:27.054711+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages · 9 internal anchors

[1]

Some Thoughts on Automation and Mathematical Research , author=
[2]

Language Models are Unsupervised Multitask Learners , author=
[3]

The Guardian , author =

Nirvana by. The Guardian , author =. 2013 , keywords =

2013
[4]

Cognition , author =

Children's understanding of counting , volume =. Cognition , author =. 1990 , pages =. doi:10.1016/0010-0277(90)90003-3 , abstract =

work page doi:10.1016/0010-0277(90)90003-3 1990
[5]

Psychological Review , author =

The logical primitives of thought:. Psychological Review , author =. 2016 , pmid =. doi:10.1037/a0039980 , abstract =

work page doi:10.1037/a0039980 2016
[6]

Daedalus , author =

Bootstrapping & the origin of concepts , volume =. Daedalus , author =. 2004 , pages =. doi:10.1162/001152604772746701 , language =

work page doi:10.1162/001152604772746701 2004
[7]

Nature625, 476–482 (2024).https://doi.org/ 10.1038/s41586-023-06747-5

Solving olympiad geometry without human demonstrations , volume =. Nature , author =. 2024 , keywords =. doi:10.1038/s41586-023-06747-5 , abstract =

work page doi:10.1038/s41586-023-06747-5 2024
[8]

Zero: the biography of a dangerous idea , isbn =
[9]

Language and learning: the debate between
[10]

Structuralism , isbn =

Piaget, Jean , editor =. Structuralism , isbn =
[11]

The origin of concepts , isbn =

Carey, Susan , year =. The origin of concepts , isbn =
[12]

The nothing that is: a natural history of zero , isbn =

Kaplan, Robert , year =. The nothing that is: a natural history of zero , isbn =
[13]

, year =

Spelke, Elizabeth S. , year =. What makes us smart?. Language in mind:. doi:10.7551/mitpress/4117.001.0001 , keywords =

work page doi:10.7551/mitpress/4117.001.0001
[14]

doi: 10.1007/978-3-030-79876-5_37

The Lean 4 Theorem Prover and Programming Language , author =. Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings , pages =. 2021 , isbn =. doi:10.1007/978-3-030-79876-5_37 , abstract =

work page doi:10.1007/978-3-030-79876-5_37 2021
[15]

Nye, Maxwell and Andreassen, Anders Johan and Gur-Ari, Guy and Michalewski, Henryk and Austin, Jacob and Bieber, David and Dohan, David and Lewkowycz, Aitor and Bosma, Maarten and Luan, David and Sutton, Charles and Odena, Augustus , month = nov, year =. Show. doi:10.48550/arXiv.2112.00114 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00114
[16]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , month = jan, year =. Chain-of-. doi:10.48550/arXiv.2201.11903 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903
[17]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005
[18]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , month = jan, year =. Scaling. doi:10.48550/arXiv.2001.08361 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2001
[19]

2025 , eprint=

From System 1 to System 2: A Survey of Reasoning Large Language Models , author=. 2025 , eprint=

2025
[20]

doi:10.48550/arXiv.2507.03876 , abstract =

Loo, Alyssa and Pavlick, Ellie and Feiman, Roman , month = jul, year =. doi:10.48550/arXiv.2507.03876 , abstract =

work page doi:10.48550/arxiv.2507.03876
[21]

Learning to reason with
[22]

Google DeepMind , author =

Advanced version of. Google DeepMind , author =
[23]

The story of

Tao, Terry , abstract =. The story of. What's new , month = dec, year =
[24]

Lin, Yong and Tang, Shange and Lyu, Bohan and Yang, Ziran and Chung, Jui-Hui and Zhao, Haoyu and Jiang, Lai and Geng, Yihan and Ge, Jiawei and Sun, Jingruo and Wu, Jiayun and Gesi, Jiri and Lu, Ximing and Acuna, David and Yang, Kaiyu and Lin, Hongzhou and Choi, Yejin and Chen, Danqi and Arora, Sanjeev and Jin, Chi , month = aug, year =. Goedel-. doi:10.48...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.03613
[25]

Ren, Z. Z. and Shao, Zhihong and Song, Junxiao and Xin, Huajian and Wang, Haocheng and Zhao, Wanjia and Zhang, Liyue and Fu, Zhe and Zhu, Qihao and Yang, Dejian and Wu, Z. F. and Gou, Zhibin and Ma, Shirong and Tang, Hongxuan and Liu, Yuxuan and Gao, Wenjun and Guo, Daya and Ruan, Chong , month = jul, year =. doi:10.48550/arXiv.2504.21801 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21801
[26]

doi:10.48550/arXiv.2402.03822 , abstract =

Shen, Si and Shen, Peijun and Zhu, Danhao , month = feb, year =. doi:10.48550/arXiv.2402.03822 , abstract =

work page doi:10.48550/arxiv.2402.03822
[27]

2018 , eprint=

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. 2018 , eprint=

2018
[28]

Mathematical

Frieder, Simon and Pinchetti, Luca and Chevalier, Alexis and Griffiths, Ryan-Rhys and Salvatori, Tommaso and Lukasiewicz, Thomas and Petersen, Philipp Christian and Berner, Julius , month = jul, year =. Mathematical. doi:10.48550/arXiv.2301.13867 , abstract =

work page doi:10.48550/arxiv.2301.13867
[29]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , month = nov, year =. Measuring. doi:10.48550/arXiv.2103.03874 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874
[30]

Llemma: An Open Language Model For Mathematics

Azerbayev, Zhangir and Schoelkopf, Hailey and Paster, Keiran and Santos, Marco Dos and McAleer, Stephen and Jiang, Albert Q. and Deng, Jia and Biderman, Stella and Welleck, Sean , month = mar, year =. Llemma:. doi:10.48550/arXiv.2310.10631 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.10631
[31]

Ayers, Dragomir Radev, and Jeremy Avigad

Azerbayev, Zhangir and Piotrowski, Bartosz and Schoelkopf, Hailey and Ayers, Edward W. and Radev, Dragomir and Avigad, Jeremy , month = feb, year =. doi:10.48550/arXiv.2302.12433 , abstract =

work page doi:10.48550/arxiv.2302.12433
[32]

Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , month = may, year =. Let's. doi:10.48550/arXiv.2305.20050 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.20050
[33]

Scientific American , author =

Ancient. Scientific American , author =
[34]

OpenWebText Corpus , author=
[35]

Scientific American , author =
[36]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[37]

2023 , eprint=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

2023
[38]

Initializing

Hewitt, John , date =. Initializing
[39]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

2020

[1] [1]

Some Thoughts on Automation and Mathematical Research , author=

[2] [2]

Language Models are Unsupervised Multitask Learners , author=

[3] [3]

The Guardian , author =

Nirvana by. The Guardian , author =. 2013 , keywords =

2013

[4] [4]

Cognition , author =

Children's understanding of counting , volume =. Cognition , author =. 1990 , pages =. doi:10.1016/0010-0277(90)90003-3 , abstract =

work page doi:10.1016/0010-0277(90)90003-3 1990

[5] [5]

Psychological Review , author =

The logical primitives of thought:. Psychological Review , author =. 2016 , pmid =. doi:10.1037/a0039980 , abstract =

work page doi:10.1037/a0039980 2016

[6] [6]

Daedalus , author =

Bootstrapping & the origin of concepts , volume =. Daedalus , author =. 2004 , pages =. doi:10.1162/001152604772746701 , language =

work page doi:10.1162/001152604772746701 2004

[7] [7]

Nature625, 476–482 (2024).https://doi.org/ 10.1038/s41586-023-06747-5

Solving olympiad geometry without human demonstrations , volume =. Nature , author =. 2024 , keywords =. doi:10.1038/s41586-023-06747-5 , abstract =

work page doi:10.1038/s41586-023-06747-5 2024

[8] [8]

Zero: the biography of a dangerous idea , isbn =

[9] [9]

Language and learning: the debate between

[10] [10]

Structuralism , isbn =

Piaget, Jean , editor =. Structuralism , isbn =

[11] [11]

The origin of concepts , isbn =

Carey, Susan , year =. The origin of concepts , isbn =

[12] [12]

The nothing that is: a natural history of zero , isbn =

Kaplan, Robert , year =. The nothing that is: a natural history of zero , isbn =

[13] [13]

, year =

Spelke, Elizabeth S. , year =. What makes us smart?. Language in mind:. doi:10.7551/mitpress/4117.001.0001 , keywords =

work page doi:10.7551/mitpress/4117.001.0001

[14] [14]

doi: 10.1007/978-3-030-79876-5_37

The Lean 4 Theorem Prover and Programming Language , author =. Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings , pages =. 2021 , isbn =. doi:10.1007/978-3-030-79876-5_37 , abstract =

work page doi:10.1007/978-3-030-79876-5_37 2021

[15] [15]

Nye, Maxwell and Andreassen, Anders Johan and Gur-Ari, Guy and Michalewski, Henryk and Austin, Jacob and Bieber, David and Dohan, David and Lewkowycz, Aitor and Bosma, Maarten and Luan, David and Sutton, Charles and Odena, Augustus , month = nov, year =. Show. doi:10.48550/arXiv.2112.00114 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00114

[16] [16]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , month = jan, year =. Chain-of-. doi:10.48550/arXiv.2201.11903 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903

[17] [17]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005

[18] [18]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , month = jan, year =. Scaling. doi:10.48550/arXiv.2001.08361 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2001

[19] [19]

2025 , eprint=

From System 1 to System 2: A Survey of Reasoning Large Language Models , author=. 2025 , eprint=

2025

[20] [20]

doi:10.48550/arXiv.2507.03876 , abstract =

Loo, Alyssa and Pavlick, Ellie and Feiman, Roman , month = jul, year =. doi:10.48550/arXiv.2507.03876 , abstract =

work page doi:10.48550/arxiv.2507.03876

[21] [21]

Learning to reason with

[22] [22]

Google DeepMind , author =

Advanced version of. Google DeepMind , author =

[23] [23]

The story of

Tao, Terry , abstract =. The story of. What's new , month = dec, year =

[24] [24]

Lin, Yong and Tang, Shange and Lyu, Bohan and Yang, Ziran and Chung, Jui-Hui and Zhao, Haoyu and Jiang, Lai and Geng, Yihan and Ge, Jiawei and Sun, Jingruo and Wu, Jiayun and Gesi, Jiri and Lu, Ximing and Acuna, David and Yang, Kaiyu and Lin, Hongzhou and Choi, Yejin and Chen, Danqi and Arora, Sanjeev and Jin, Chi , month = aug, year =. Goedel-. doi:10.48...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.03613

[25] [25]

Ren, Z. Z. and Shao, Zhihong and Song, Junxiao and Xin, Huajian and Wang, Haocheng and Zhao, Wanjia and Zhang, Liyue and Fu, Zhe and Zhu, Qihao and Yang, Dejian and Wu, Z. F. and Gou, Zhibin and Ma, Shirong and Tang, Hongxuan and Liu, Yuxuan and Gao, Wenjun and Guo, Daya and Ruan, Chong , month = jul, year =. doi:10.48550/arXiv.2504.21801 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21801

[26] [26]

doi:10.48550/arXiv.2402.03822 , abstract =

Shen, Si and Shen, Peijun and Zhu, Danhao , month = feb, year =. doi:10.48550/arXiv.2402.03822 , abstract =

work page doi:10.48550/arxiv.2402.03822

[27] [27]

2018 , eprint=

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. 2018 , eprint=

2018

[28] [28]

Mathematical

Frieder, Simon and Pinchetti, Luca and Chevalier, Alexis and Griffiths, Ryan-Rhys and Salvatori, Tommaso and Lukasiewicz, Thomas and Petersen, Philipp Christian and Berner, Julius , month = jul, year =. Mathematical. doi:10.48550/arXiv.2301.13867 , abstract =

work page doi:10.48550/arxiv.2301.13867

[29] [29]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , month = nov, year =. Measuring. doi:10.48550/arXiv.2103.03874 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874

[30] [30]

Llemma: An Open Language Model For Mathematics

Azerbayev, Zhangir and Schoelkopf, Hailey and Paster, Keiran and Santos, Marco Dos and McAleer, Stephen and Jiang, Albert Q. and Deng, Jia and Biderman, Stella and Welleck, Sean , month = mar, year =. Llemma:. doi:10.48550/arXiv.2310.10631 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.10631

[31] [31]

Ayers, Dragomir Radev, and Jeremy Avigad

Azerbayev, Zhangir and Piotrowski, Bartosz and Schoelkopf, Hailey and Ayers, Edward W. and Radev, Dragomir and Avigad, Jeremy , month = feb, year =. doi:10.48550/arXiv.2302.12433 , abstract =

work page doi:10.48550/arxiv.2302.12433

[32] [32]

Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , month = may, year =. Let's. doi:10.48550/arXiv.2305.20050 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.20050

[33] [33]

Scientific American , author =

Ancient. Scientific American , author =

[34] [34]

OpenWebText Corpus , author=

[35] [35]

Scientific American , author =

[36] [36]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[37] [37]

2023 , eprint=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

2023

[38] [38]

Initializing

Hewitt, John , date =. Initializing

[39] [39]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

2020