arxiv: 2604.24037 · v3 · submitted 2026-04-27 · 💻 cs.LG · math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

Jun Shu , Junxiong Jia , Deyu Meng , Zongben Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:22 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH

keywords emergent intelligencefoundation modelsscaling lawslimit theoryLipschitz operatorsperformance functionmodel architecturenonlinear operator theory

0 comments

The pith

Emergent intelligence in foundation models arises from the existence of a parameter-limit architecture as size, data and training steps approach infinity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a performance function E(N, P, K) depending on data size N, model size P and training steps K. Emergent intelligence is defined as the limit of this function when all three arguments tend to infinity. The authors prove using nonlinear Lipschitz operator theory that this limit exists if and only if a parameter-limit architecture is present. They also derive scaling laws and identify the condition Lip(T)=1 as critical for emergence. Readers would care because the framework replaces empirical description with a mathematical criterion for when scaling produces new capabilities.

Core claim

Emergent intelligence is recast as the existence of the limit lim N,P,K→∞ E(N,P,K). This limit is produced by a parameter-limit architecture whose learning behavior matches the observed emergence. Nonlinear Lipschitz operator theory supplies the necessary and sufficient conditions for the architecture to exist. Scaling laws are derived via Lipschitz operators and covering numbers. The results state that emergence depends on training steps, data size and the properties of basic blocks in the architecture, with the condition Lip(T)=1 serving as the critical threshold.

What carries the argument

the parameter-limit architecture, the infinite-dimensional system whose existence is necessary and sufficient for the performance limit lim N,P,K→∞ E(N,P,K) to exist and whose learning dynamics produce emergent abilities.

If this is right

Emergent intelligence is governed by training steps, data size and model architecture, with the properties of basic blocks playing a decisive role.
The critical condition Lip(T)=1 provides theoretical support for existing empirical findings on when emergence occurs.
Emergent intelligence is determined by an infinite-dimensional system yet can be realized through finite-dimensional architectures.
Scaling laws for foundation models are obtained directly from Lipschitz operator and covering number arguments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures could be deliberately designed to satisfy Lip(T)=1 in order to promote controlled emergence.
Finite models may already serve as effective approximations to the limit architecture, allowing scaling benefits to continue without literal infinity.
The same limit framework might extend to other scaling behaviors observed in machine learning systems.

Load-bearing premise

The performance function E(N, P, K) is sufficiently well-behaved that the triple limit as N, P and K approach infinity exists, and nonlinear Lipschitz operator theory applies directly to the training dynamics without further regularity assumptions.

What would settle it

A scaling experiment that increases N, P and K while the observed Lip(T) value stays away from 1 yet new emergent abilities still appear, or that increases all three variables while the limit of E fails to exist and no new abilities appear.

read the original abstract

Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper recasts emergent intelligence as the existence of a limit architecture proved via Lipschitz operators and derives scaling laws from covering numbers, but the abstract and stress-test show no explicit function space, topology, or regularity checks for real training dynamics.

read the letter

The main takeaway is that this work tries to explain why scaling produces emergence by defining it as the limit of a performance function E(N,P,K) as everything goes to infinity, then claiming necessary and sufficient conditions from nonlinear Lipschitz operator theory for a corresponding limit architecture to exist. They also say this yields scaling laws via covering numbers and that Lip(T)=1 is the critical threshold, with basic block properties mattering a lot. The empirical part is just stated as corroborating the theory.

Referee Report

3 major / 1 minor

Summary. The paper develops a limit theory for foundation models by introducing a performance function E(N,P,K) that quantifies intelligence behavior in terms of data size N, model size P, and training steps K. It recasts emergent intelligence as the existence of the limit lim N,P,K→∞ E(N,P,K), attributes this to the existence of a 'limit architecture' whose necessary and sufficient conditions are proved via nonlinear Lipschitz operator theory, derives scaling laws using Lipschitz operators and covering numbers, and states that basic-block properties are crucial while Lip(T)=1 is the critical condition; empirical results are said to corroborate the theory.

Significance. If the claimed derivations and applications of nonlinear Lipschitz operator theory hold with the required regularity conditions, the work could supply a rigorous mathematical framework linking emergent abilities and scaling laws to operator-theoretic limits, moving the field beyond purely empirical characterizations. The attempt to formalize the transition to infinite knowledge via a parameter-limit architecture and to connect it to practical finite architectures is a substantive direction, though its impact depends on filling in the missing technical details.

major comments (3)

[Abstract] Abstract: The abstract asserts that tools from nonlinear Lipschitz operator theory prove the necessary and sufficient conditions for existence of the limit architecture, yet supplies no explicit construction of the operator T, the underlying complete metric space (e.g., Banach space of functions or operators), the norm, or verification that training dynamics of standard architectures satisfy uniform Lipschitz continuity or the condition Lip(T)=1.
[Abstract] Abstract: The performance function E(N,P,K) is posited to admit a limit as N,P,K→∞ that equals the learning behavior of the limit architecture, but no topology, function space, or regularity assumptions (e.g., continuity or boundedness needed for the limit to exist and for the operator theory to apply) are stated; without these the claimed correspondence cannot be verified and the theory does not transfer to practical foundation models.
[Abstract] Abstract / Theoretical results: The derivation of the scaling law via Lipschitz operator and covering number is listed as a result, but no intermediate steps, explicit use of the covering number, or how it produces the scaling form are provided, leaving the link between the operator-theoretic conditions and the scaling law unsupported.

minor comments (1)

[Abstract] Abstract: grammatical error in the opening sentence ('Emergent intelligence have played') should read 'has played'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. The points raised highlight areas where the presentation of the technical details can be improved for clarity. We address each major comment below and will revise the manuscript accordingly to strengthen the exposition of the operator-theoretic framework.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that tools from nonlinear Lipschitz operator theory prove the necessary and sufficient conditions for existence of the limit architecture, yet supplies no explicit construction of the operator T, the underlying complete metric space (e.g., Banach space of functions or operators), the norm, or verification that training dynamics of standard architectures satisfy uniform Lipschitz continuity or the condition Lip(T)=1.

Authors: We agree that the abstract, being concise, does not supply these explicit elements. In the revised manuscript we will add a new subsection that constructs T explicitly as the nonlinear operator induced by the infinite-horizon training dynamics acting on the space of performance functions. The underlying space is the Banach space of bounded continuous functions on the input domain equipped with the supremum norm. We will verify that, when each basic block satisfies a non-expansive property, the composite operator satisfies Lip(T) = 1 uniformly. A concrete verification for transformer attention blocks will be included to demonstrate applicability to standard architectures. revision: yes
Referee: [Abstract] Abstract: The performance function E(N,P,K) is posited to admit a limit as N,P,K→∞ that equals the learning behavior of the limit architecture, but no topology, function space, or regularity assumptions (e.g., continuity or boundedness needed for the limit to exist and for the operator theory to apply) are stated; without these the claimed correspondence cannot be verified and the theory does not transfer to practical foundation models.

Authors: The referee correctly identifies that the required regularity conditions are not stated in the abstract. We will revise the manuscript to specify that E(N,P,K) is continuous with respect to the product topology on the data, parameter, and iteration spaces and takes values in the bounded interval [0,1]. The limit is understood in the topology of uniform convergence. Under these conditions the correspondence between the finite-model performance and the limit architecture follows directly from the continuity of the performance map with respect to the operator norm. This clarification will make the transfer to practical models explicit. revision: yes
Referee: [Abstract] Abstract / Theoretical results: The derivation of the scaling law via Lipschitz operator and covering number is listed as a result, but no intermediate steps, explicit use of the covering number, or how it produces the scaling form are provided, leaving the link between the operator-theoretic conditions and the scaling law unsupported.

Authors: We acknowledge that the abstract omits the intermediate derivation steps. In the revision we will expand the theoretical results section to include a step-by-step argument: the Lipschitz condition Lip(T)=1 is used to bound the deviation between finite and limit performance; the covering number of the function class generated by the Lipschitz operator is then invoked to control the entropy integral; the resulting sample-complexity bound yields the observed power-law scaling in model size P. The explicit dependence on the covering number will be displayed, thereby connecting the operator-theoretic hypothesis directly to the scaling form. revision: yes

Circularity Check

1 steps flagged

Emergent intelligence defined as limit of E(N,P,K) then explained by limit architecture whose existence conditions are proved equivalent by construction

specific steps

self definitional [Abstract]
"we posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit lim_{N,P,K → ∞} E(N,P,K), with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the nec"

Emergent intelligence is defined directly as the existence of the limit of the performance function. The limit architecture is then introduced as the origin of, and rational correspondent to, that same limit. The subsequent proof of necessary and sufficient conditions for the limit architecture therefore characterizes the conditions under which the defined limit exists, rendering the explanatory claim equivalent to the definitional step by construction.

full rationale

The paper's central derivation begins by explicitly recasting emergent intelligence as the mathematical existence of lim N,P,K→∞ E(N,P,K). It then posits that this same phenomenon 'originates from' and 'rationally corresponds to' a limit architecture, whose necessary and sufficient existence conditions are derived via nonlinear Lipschitz operator theory. Because the limit architecture is introduced solely to account for the behavior already defined as the limit, and no independent characterization of the architecture (separate from the limit of E) is supplied, the claimed explanation reduces to the initial definition. The Lipschitz conditions characterize when the defined limit exists but do not furnish an external grounding that separates the architecture from the performance limit it is said to produce. This matches the self-definitional pattern without requiring external benchmarks or fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a performance function E(N,P,K) admits a well-defined limit whose existence can be characterized by Lipschitz conditions on an unspecified training operator; the limit architecture is postulated without independent evidence outside the definition of the limit itself.

axioms (1)

domain assumption The performance function E(N, P, K) is well-defined for foundation models and the limit as N,P,K to infinity exists under suitable conditions on the architecture.
This is the foundational posit that allows recasting emergence as a limit existence question.

invented entities (1)

limit architecture no independent evidence
purpose: The architecture realized in the infinite-parameter limit that governs emergent intelligence and scaling behavior.
Introduced to explain the origin of the limit without external validation or falsifiable prediction beyond the limit definition itself.

pith-pipeline@v0.9.0 · 5608 in / 1537 out tokens · 72813 ms · 2026-05-13T07:22:18.447116+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the necessary and sufficient conditions for the existence of the limit architecture are (i) Lip(Ti)≤1 for all i≥K0 ... and (ii) there exists a non-expansive operator T such that ∥Ti−T∥≤ϵi with ∑ϵi<∞
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Lip(T)=1 thus is a critical case at which a very complex dynamics might occur ... the emergent intelligence most likely comes from the critical setting Lip(T)=1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

159 extracted references · 159 canonical work pages · 22 internal anchors

[1]

arXiv preprint arXiv:2507.08527 (2025)

Vock, S., Meisel, C.: Critical dynamics governs deep learning. arXiv preprint arXiv:2507.08527 (2025)

work page arXiv 2025
[2]

Chinese Physics Letters42(12), 120002 (2025)

Cai, X., Hu, S., Wang, T., Huang, Y., Zhang, P., Deng, Y., Chen, K.: Learning- at-criticality in large language models for quantum field theory and beyond. Chinese Physics Letters42(12), 120002 (2025)

work page 2025
[3]

arXiv preprint arXiv:2512.05117 (2025)

Kaushik, P., Chaudhari, S., Vaidya, A., Chellappa, R., Yuille, A.: The universal weight subspace hypothesis. arXiv preprint arXiv:2512.05117 (2025)

work page arXiv 2025
[4]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

In: Advances in Neural Information Processing Systems, vol

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

work page 1901
[6]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Journal of Machine Learning Research24(240), 1–113 (2023)

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S.,et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

work page 2023
[10]

In: International Conference on Machine Learning, pp

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831 (2021). Pmlr

work page 2021
[11]

In: International Conference on Machine Learning, pp

Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703 (2020). PMLR

work page 2020
[12]

In: Proceedings 37 of the IEEE/CVF International Conference on Computer Vision, pp

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y.,et al.: Segment anything. In: Proceedings 37 of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)

work page 2023
[13]

In: European Conference on Computer Vision, pp

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H.,et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European Conference on Computer Vision, pp. 38–55 (2025). Springer

work page 2025
[14]

In: Advances in Neural Information Processing Systems, vol

Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y.,et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In: Advances in Neural Information Processing Systems, vol. 36 (2023)

work page 2023
[15]

In: Advances in Neural Information Processing Systems, vol

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

work page 2024
[16]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators (2024)

work page 2024
[17]

In: International Conference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR

work page 2021
[18]

In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp

Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: Graphmae: Self-supervised masked graph autoencoders. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022)

work page 2022
[19]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Tian, Y., Dong, K., Zhang, C., Zhang, C., Chawla, N.V.: Heterogeneous graph masked autoencoders. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 9997–10005 (2023)

work page 2023
[20]

In: Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28, pp

Moura, L.d., Ullrich, S.: The lean 4 theorem prover and programming language. In: Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28, pp. 625–635 (2021). Springer

work page 2021
[21]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Transactions on Machine Learning Research (2022) 38

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of large language models. Transactions on Machine Learning Research (2022) 38

work page 2022
[23]

Transactions on Machine Learning Research (2023)

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A.M., Abid, A., Fisch, A., Brown, A.R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al.: Beyond the imita- tion game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023)

work page 2023
[24]

In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp

Ganguli, D., Hernandez, D., Lovitt, L., Askell, A., Bai, Y., Chen, A., Conerly, T., Dassarma, N., Drain, D., Elhage, N.,et al.: Predictability and surprise in large generative models. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1747–1764 (2022)

work page 2022
[25]

Science177(4047), 393–396 (1972)

Anderson, P.W.: More is different: Broken symmetry and the nature of the hierarchical structure of science. Science177(4047), 393–396 (1972)

work page 1972
[26]

Nature materials11(2), 103–113 (2012)

Hwang, H.Y., Iwasa, Y., Kawasaki, M., Keimer, B., Nagaosa, N., Tokura, Y.: Emergent phenomena at oxide interfaces. Nature materials11(2), 103–113 (2012)

work page 2012
[27]

Simon & Schuster, New York (2002)

Johnson, S.: Emergence: The Connected Lives of Ants, Brains, Cities, and Software. Simon & Schuster, New York (2002)

work page 2002
[28]

Accessed May 20 (2022)

Steinhardt, J.: Future ml systems will be qualitatively different. Accessed May 20 (2022)

work page 2022
[29]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[30]

In: Advances in Neural Information Processing Systems (2022)

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A.,et al.: Training compute- optimal large language models. In: Advances in Neural Information Processing Systems (2022)

work page 2022
[31]

Claude-3 Model Card1(1), 4 (2024)

Anthropic, A.: The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card1(1), 4 (2024)

work page 2024
[32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al.: Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Nature645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X.,et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

work page 2025
[35]

39 In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. 39 In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113 (2022)

work page 2022
[36]

In: Advances in Neural Information Processing Systems, vol

Bachmann, G., Anagnostidis, S., Hofmann, T.: Scaling mlps: A tale of inductive bias. In: Advances in Neural Information Processing Systems, vol. 36 (2023)

work page 2023
[37]

In: Advances in Neural Information Processing Systems, vol

Muennighoff, N., Rush, A., Barak, B., Le Scao, T., Tazi, N., Piktus, A., Pyysalo, S., Wolf, T., Raffel, C.A.: Scaling data-constrained language models. In: Advances in Neural Information Processing Systems, vol. 36, pp. 50358–50376 (2023)

work page 2023
[38]

arXiv:2404.10102 , year=

Besiroglu, T., Erdil, E., Barnett, M., You, J.: Chinchilla scaling: A replication attempt. arXiv preprint arXiv:2404.10102 (2024)

work page arXiv 2024
[39]

In: Advances in Neural Information Processing Systems, vol

Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., Morcos, A.: Beyond neural scaling laws: beating power law scaling via data pruning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 19523–19536 (2022)

work page 2022
[40]

In: International Conference on Machine Learning (2024)

Sardana, N., Portes, J., Doubov, S., Frankle, J.: Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. In: International Conference on Machine Learning (2024)

work page 2024
[41]

In: Advances in Neural Information Processing Systems (2024)

Ruan, Y., Maddison, C.J., Hashimoto, T.: Observational scaling laws and the predictability of language model performance. In: Advances in Neural Information Processing Systems (2024)

work page 2024
[42]

Y ., Smyrnis, G., Shankar, V ., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al

Gadre, S.Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al.: Language models scale reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540 (2024)

work page arXiv 2024
[43]

In: Advances in Neural Information Processing Systems, vol

Du, Z., Zeng, A., Dong, Y., Tang, J.: Understanding emergent abilities of language models from the loss perspective. In: Advances in Neural Information Processing Systems, vol. 37, pp. 53138–53167 (2024)

work page 2024
[44]

Proceedings of the National Academy of Sciences79(8), 2554–2558 (1982)

Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences79(8), 2554–2558 (1982)

work page 1982
[45]

Schaeffer, R., Miranda, B., Koyejo, S.: Are emergent abilities of large language models a mirage?, vol. 36, pp. 55565–55581 (2023)

work page 2023
[46]

5098–5139 (2024)

Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H.T., Gurevych, I.: Are emergent abilities in large language models just in-context learning? In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5098–5139 (2024)

work page 2024
[47]

Oxford University Press, Oxford (2000) 40

Holland, J.H.: Emergence: From Chaos to Order. Oxford University Press, Oxford (2000) 40

work page 2000
[48]

Proceedings of the National Academy of Sciences112(27), 8356–8361 (2015)

O’Dwyer, J.P., Kembel, S.W., Sharpton, T.J.: Backbones of evolutionary history test biodiversity theory for microbes. Proceedings of the National Academy of Sciences112(27), 8356–8361 (2015)

work page 2015
[49]

Frontiers in Ecology and Evolution7, 445231 (2019)

Kempes, C.P., Koehl, M., West, G.B.: The scales that limit: the physical boundaries of evolution. Frontiers in Ecology and Evolution7, 445231 (2019)

work page 2019
[50]

arXiv preprint arXiv:2309.07311 (2023)

Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M.L., Saphra, N.: Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms. arXiv preprint arXiv:2309.07311 (2023)

work page arXiv 2023
[51]

arXiv preprint arXiv:2406.05335 (2024)

Nakaishi, K., Nishikawa, Y., Hukushima, K.: Critical phase transition in large language models. arXiv preprint arXiv:2406.05335 (2024)

work page arXiv 2024
[52]

In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp

Liu, P., Liu, Z., Gao, Z.-F., Gao, D., Zhao, W.X., Li, Y., Ding, B., Wen, J.-R.: Do emergent abilities exist in quantized large language models: An empirical study. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 5174–5190 (2024)

work page 2024
[53]

arXiv preprint arXiv:2410.01692 (2024)

Wu, T.-Y., Lo, P.-Y.: U-shaped and inverted-u scaling behind emergent abilities of large language models. arXiv preprint arXiv:2410.01692 (2024)

work page arXiv 2024
[54]

In: Advances in Neural Information Processing Systems, vol

Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., Hill, F.: Data distributional properties drive emergent in-context learning in transformers. In: Advances in Neural Information Processing Systems, vol. 35, pp. 18878–18891 (2022)

work page 2022
[55]

44991–45013 (2024)

Shi, Z., Wei, J., Xu, Z., Liang, Y.: Why larger language models do in-context learning differently? In: International Conference on Machine Learning, pp. 44991–45013 (2024). PMLR

work page 2024
[56]

In: Advances in Neural Information Processing Systems, vol

Ravent´ os, A., Paul, M., Chen, F., Ganguli, S.: Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. In: Advances in Neural Information Processing Systems, vol. 36, pp. 14228–14246 (2023)

work page 2023
[57]

In: Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning@ COLING 2025, pp

Al-Saeedi, A., H¨ arm¨ a, A.: Emergence of symbolic abstraction heads for in-context learning in large language models. In: Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning@ COLING 2025, pp. 86–96 (2025)

work page 2025
[58]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open 41 language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Understanding the planning of LLM agents: A survey

Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

work page internal anchor Pith review arXiv 2024
[61]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y.-J., Huang, G.: Expel: Llm agents are experiential learners. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19632–19642 (2024)

work page 2024
[62]

Foundations of Computational mathematics7(3), 331–368 (2007)

Caponnetto, A., De Vito, E.: Optimal rates for the regularized least-squares algorithm. Foundations of Computational mathematics7(3), 331–368 (2007)

work page 2007
[63]

Deep Learning Scaling is Predictable, Empirically

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M.M.A., Yang, Y., Zhou, Y.: Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[64]

In: Advances in Neural Information Processing Systems, vol

Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., Carmon, Y.: Resolving discrep- ancies in compute-optimal scaling of language models. In: Advances in Neural Information Processing Systems, vol. 37, pp. 100535–100570 (2024)

work page 2024
[65]

2021 , journal=

Hernandez, D., Kaplan, J., Henighan, T., McCandlish, S.: Scaling laws for transfer. arXiv preprint arXiv:2102.01293 (2021)

work page arXiv 2021
[66]

In: International Conference on Machine Learning, pp

Aghajanyan, A., Yu, L., Conneau, A., Hsu, W.-N., Hambardzumyan, K., Zhang, S., Roller, S., Goyal, N., Levy, O., Zettlemoyer, L.: Scaling laws for generative mixed-modal language models. In: International Conference on Machine Learning, pp. 265–279 (2023). PMLR

work page 2023
[67]

In: International Conference on Machine Learning, pp

Poli, M., Thomas, A.W., Nguyen, E., Ponnusamy, P., Deiseroth, B., Kersting, K., Suzuki, T., Hie, B., Ermon, S., Re, C.,et al.: Mechanistic design and scaling of hybrid architectures. In: International Conference on Machine Learning, pp. 40908–40950 (2024). PMLR

work page 2024
[68]

In: First Conference on Language Modeling

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling

work page
[69]

arXiv preprint arXiv:2402.07871 , year=

Krajewski, J., Ludziejewski, J., Adamczewski, K., Pi´ oro, M., Krutul, M., Anto- niak, S., Ciebiera, K., Kr´ ol, K., Odrzyg´ o´ zd´ z, T., Sankowski, P., et al.: Scaling laws for fine-grained mixture of experts. arXiv preprint arXiv:2402.07871 (2024)

work page arXiv 2024
[70]

In: International Conference on Learning Representations (ICLR) (2025)

Gadre, S.Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S.,et al.: Language models scale reliably with over-training and on downstream tasks. In: International Conference on Learning Representations (ICLR) (2025)

work page 2025
[71]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Chen, Z., Wang, S., Xiao, T., Wang, Y., Chen, S., Cai, X., He, J., Wang, J.: 42 Revisiting scaling laws for language models: The role of data quality and training strategies. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 23881–23899 (2025)

work page 2025
[72]

In: International Conference on Machine Learning, pp

Sardana, N., Portes, J., Doubov, S., Frankle, J.: Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. In: International Conference on Machine Learning, pp. 43445–43460 (2024). PMLR

work page 2024
[73]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q.V., R´ e, C., Mirhoseini, A.: Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Luo, L., Liu, Y., Liu, R., Phatale, S., Guo, M., Lara, H., Li, Y., Shu, L., Zhu, Y., Meng, L., et al.: Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592 (2024)

work page internal anchor Pith review arXiv 2024
[75]

arXiv preprint arXiv:2501.04519 , year=

Guan, X., Zhang, L.L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., Yang, M.: Rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519 (2025)

work page arXiv 2025
[76]

In: International Conference on Machine Learning, pp

Bordelon, B., Canatar, A., Pehlevan, C.: Spectrum dependent learning curves in kernel regression and wide neural networks. In: International Conference on Machine Learning, pp. 1024–1034 (2020). PMLR

work page 2020
[77]

Nature communications12(1), 2914 (2021)

Canatar, A., Bordelon, B., Pehlevan, C.: Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications12(1), 2914 (2021)

work page 2021
[78]

Transactions on Machine Learning Research (2023)

Simon, J.B., Dickens, M., Karkada, D., DeWeese, M.R.: The eigenlearning frame- work: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research (2023)

work page 2023
[79]

Proceedings of the National Academy of Sciences121(27), 2311878121 (2024)

Bahri, Y., Dyer, E., Kaplan, J., Lee, J., Sharma, U.: Explaining neural scaling laws. Proceedings of the National Academy of Sciences121(27), 2311878121 (2024)

work page 2024
[80]

The Annals of Statistics50(2), 949–986 (2022)

Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.J.: Surprises in high- dimensional ridgeless least squares interpolation. The Annals of Statistics50(2), 949–986 (2022)

work page 2022

Showing first 80 references.