pith. machine review for the scientific record. sign in

arxiv: 2604.24037 · v3 · submitted 2026-04-27 · 💻 cs.LG · math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:22 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH
keywords emergent intelligencefoundation modelsscaling lawslimit theoryLipschitz operatorsperformance functionmodel architecturenonlinear operator theory
0
0 comments X

The pith

Emergent intelligence in foundation models arises from the existence of a parameter-limit architecture as size, data and training steps approach infinity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a performance function E(N, P, K) depending on data size N, model size P and training steps K. Emergent intelligence is defined as the limit of this function when all three arguments tend to infinity. The authors prove using nonlinear Lipschitz operator theory that this limit exists if and only if a parameter-limit architecture is present. They also derive scaling laws and identify the condition Lip(T)=1 as critical for emergence. Readers would care because the framework replaces empirical description with a mathematical criterion for when scaling produces new capabilities.

Core claim

Emergent intelligence is recast as the existence of the limit lim N,P,K→∞ E(N,P,K). This limit is produced by a parameter-limit architecture whose learning behavior matches the observed emergence. Nonlinear Lipschitz operator theory supplies the necessary and sufficient conditions for the architecture to exist. Scaling laws are derived via Lipschitz operators and covering numbers. The results state that emergence depends on training steps, data size and the properties of basic blocks in the architecture, with the condition Lip(T)=1 serving as the critical threshold.

What carries the argument

the parameter-limit architecture, the infinite-dimensional system whose existence is necessary and sufficient for the performance limit lim N,P,K→∞ E(N,P,K) to exist and whose learning dynamics produce emergent abilities.

If this is right

  • Emergent intelligence is governed by training steps, data size and model architecture, with the properties of basic blocks playing a decisive role.
  • The critical condition Lip(T)=1 provides theoretical support for existing empirical findings on when emergence occurs.
  • Emergent intelligence is determined by an infinite-dimensional system yet can be realized through finite-dimensional architectures.
  • Scaling laws for foundation models are obtained directly from Lipschitz operator and covering number arguments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures could be deliberately designed to satisfy Lip(T)=1 in order to promote controlled emergence.
  • Finite models may already serve as effective approximations to the limit architecture, allowing scaling benefits to continue without literal infinity.
  • The same limit framework might extend to other scaling behaviors observed in machine learning systems.

Load-bearing premise

The performance function E(N, P, K) is sufficiently well-behaved that the triple limit as N, P and K approach infinity exists, and nonlinear Lipschitz operator theory applies directly to the training dynamics without further regularity assumptions.

What would settle it

A scaling experiment that increases N, P and K while the observed Lip(T) value stays away from 1 yet new emergent abilities still appear, or that increases all three variables while the limit of E fails to exist and no new abilities appear.

read the original abstract

Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper develops a limit theory for foundation models by introducing a performance function E(N,P,K) that quantifies intelligence behavior in terms of data size N, model size P, and training steps K. It recasts emergent intelligence as the existence of the limit lim N,P,K→∞ E(N,P,K), attributes this to the existence of a 'limit architecture' whose necessary and sufficient conditions are proved via nonlinear Lipschitz operator theory, derives scaling laws using Lipschitz operators and covering numbers, and states that basic-block properties are crucial while Lip(T)=1 is the critical condition; empirical results are said to corroborate the theory.

Significance. If the claimed derivations and applications of nonlinear Lipschitz operator theory hold with the required regularity conditions, the work could supply a rigorous mathematical framework linking emergent abilities and scaling laws to operator-theoretic limits, moving the field beyond purely empirical characterizations. The attempt to formalize the transition to infinite knowledge via a parameter-limit architecture and to connect it to practical finite architectures is a substantive direction, though its impact depends on filling in the missing technical details.

major comments (3)
  1. [Abstract] Abstract: The abstract asserts that tools from nonlinear Lipschitz operator theory prove the necessary and sufficient conditions for existence of the limit architecture, yet supplies no explicit construction of the operator T, the underlying complete metric space (e.g., Banach space of functions or operators), the norm, or verification that training dynamics of standard architectures satisfy uniform Lipschitz continuity or the condition Lip(T)=1.
  2. [Abstract] Abstract: The performance function E(N,P,K) is posited to admit a limit as N,P,K→∞ that equals the learning behavior of the limit architecture, but no topology, function space, or regularity assumptions (e.g., continuity or boundedness needed for the limit to exist and for the operator theory to apply) are stated; without these the claimed correspondence cannot be verified and the theory does not transfer to practical foundation models.
  3. [Abstract] Abstract / Theoretical results: The derivation of the scaling law via Lipschitz operator and covering number is listed as a result, but no intermediate steps, explicit use of the covering number, or how it produces the scaling form are provided, leaving the link between the operator-theoretic conditions and the scaling law unsupported.
minor comments (1)
  1. [Abstract] Abstract: grammatical error in the opening sentence ('Emergent intelligence have played') should read 'has played'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. The points raised highlight areas where the presentation of the technical details can be improved for clarity. We address each major comment below and will revise the manuscript accordingly to strengthen the exposition of the operator-theoretic framework.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that tools from nonlinear Lipschitz operator theory prove the necessary and sufficient conditions for existence of the limit architecture, yet supplies no explicit construction of the operator T, the underlying complete metric space (e.g., Banach space of functions or operators), the norm, or verification that training dynamics of standard architectures satisfy uniform Lipschitz continuity or the condition Lip(T)=1.

    Authors: We agree that the abstract, being concise, does not supply these explicit elements. In the revised manuscript we will add a new subsection that constructs T explicitly as the nonlinear operator induced by the infinite-horizon training dynamics acting on the space of performance functions. The underlying space is the Banach space of bounded continuous functions on the input domain equipped with the supremum norm. We will verify that, when each basic block satisfies a non-expansive property, the composite operator satisfies Lip(T) = 1 uniformly. A concrete verification for transformer attention blocks will be included to demonstrate applicability to standard architectures. revision: yes

  2. Referee: [Abstract] Abstract: The performance function E(N,P,K) is posited to admit a limit as N,P,K→∞ that equals the learning behavior of the limit architecture, but no topology, function space, or regularity assumptions (e.g., continuity or boundedness needed for the limit to exist and for the operator theory to apply) are stated; without these the claimed correspondence cannot be verified and the theory does not transfer to practical foundation models.

    Authors: The referee correctly identifies that the required regularity conditions are not stated in the abstract. We will revise the manuscript to specify that E(N,P,K) is continuous with respect to the product topology on the data, parameter, and iteration spaces and takes values in the bounded interval [0,1]. The limit is understood in the topology of uniform convergence. Under these conditions the correspondence between the finite-model performance and the limit architecture follows directly from the continuity of the performance map with respect to the operator norm. This clarification will make the transfer to practical models explicit. revision: yes

  3. Referee: [Abstract] Abstract / Theoretical results: The derivation of the scaling law via Lipschitz operator and covering number is listed as a result, but no intermediate steps, explicit use of the covering number, or how it produces the scaling form are provided, leaving the link between the operator-theoretic conditions and the scaling law unsupported.

    Authors: We acknowledge that the abstract omits the intermediate derivation steps. In the revision we will expand the theoretical results section to include a step-by-step argument: the Lipschitz condition Lip(T)=1 is used to bound the deviation between finite and limit performance; the covering number of the function class generated by the Lipschitz operator is then invoked to control the entropy integral; the resulting sample-complexity bound yields the observed power-law scaling in model size P. The explicit dependence on the covering number will be displayed, thereby connecting the operator-theoretic hypothesis directly to the scaling form. revision: yes

Circularity Check

1 steps flagged

Emergent intelligence defined as limit of E(N,P,K) then explained by limit architecture whose existence conditions are proved equivalent by construction

specific steps
  1. self definitional [Abstract]
    "we posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit lim_{N,P,K → ∞} E(N,P,K), with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the nec"

    Emergent intelligence is defined directly as the existence of the limit of the performance function. The limit architecture is then introduced as the origin of, and rational correspondent to, that same limit. The subsequent proof of necessary and sufficient conditions for the limit architecture therefore characterizes the conditions under which the defined limit exists, rendering the explanatory claim equivalent to the definitional step by construction.

full rationale

The paper's central derivation begins by explicitly recasting emergent intelligence as the mathematical existence of lim N,P,K→∞ E(N,P,K). It then posits that this same phenomenon 'originates from' and 'rationally corresponds to' a limit architecture, whose necessary and sufficient existence conditions are derived via nonlinear Lipschitz operator theory. Because the limit architecture is introduced solely to account for the behavior already defined as the limit, and no independent characterization of the architecture (separate from the limit of E) is supplied, the claimed explanation reduces to the initial definition. The Lipschitz conditions characterize when the defined limit exists but do not furnish an external grounding that separates the architecture from the performance limit it is said to produce. This matches the self-definitional pattern without requiring external benchmarks or fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a performance function E(N,P,K) admits a well-defined limit whose existence can be characterized by Lipschitz conditions on an unspecified training operator; the limit architecture is postulated without independent evidence outside the definition of the limit itself.

axioms (1)
  • domain assumption The performance function E(N, P, K) is well-defined for foundation models and the limit as N,P,K to infinity exists under suitable conditions on the architecture.
    This is the foundational posit that allows recasting emergence as a limit existence question.
invented entities (1)
  • limit architecture no independent evidence
    purpose: The architecture realized in the infinite-parameter limit that governs emergent intelligence and scaling behavior.
    Introduced to explain the origin of the limit without external validation or falsifiable prediction beyond the limit definition itself.

pith-pipeline@v0.9.0 · 5608 in / 1537 out tokens · 72813 ms · 2026-05-13T07:22:18.447116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the necessary and sufficient conditions for the existence of the limit architecture are (i) Lip(Ti)≤1 for all i≥K0 ... and (ii) there exists a non-expansive operator T such that ∥Ti−T∥≤ϵi with ∑ϵi<∞

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Lip(T)=1 thus is a critical case at which a very complex dynamics might occur ... the emergent intelligence most likely comes from the critical setting Lip(T)=1

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

159 extracted references · 159 canonical work pages · 22 internal anchors

  1. [1]

    arXiv preprint arXiv:2507.08527 (2025)

    Vock, S., Meisel, C.: Critical dynamics governs deep learning. arXiv preprint arXiv:2507.08527 (2025)

  2. [2]

    Chinese Physics Letters42(12), 120002 (2025)

    Cai, X., Hu, S., Wang, T., Huang, Y., Zhang, P., Deng, Y., Chen, K.: Learning- at-criticality in large language models for quantum field theory and beyond. Chinese Physics Letters42(12), 120002 (2025)

  3. [3]

    arXiv preprint arXiv:2512.05117 (2025)

    Kaushik, P., Chaudhari, S., Vaidya, A., Chellappa, R., Yuille, A.: The universal weight subspace hypothesis. arXiv preprint arXiv:2512.05117 (2025)

  4. [4]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  5. [5]

    In: Advances in Neural Information Processing Systems, vol

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

  6. [6]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  7. [7]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  8. [8]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  9. [9]

    Journal of Machine Learning Research24(240), 1–113 (2023)

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S.,et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

  10. [10]

    In: International Conference on Machine Learning, pp

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831 (2021). Pmlr

  11. [11]

    In: International Conference on Machine Learning, pp

    Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703 (2020). PMLR

  12. [12]

    In: Proceedings 37 of the IEEE/CVF International Conference on Computer Vision, pp

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y.,et al.: Segment anything. In: Proceedings 37 of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)

  13. [13]

    In: European Conference on Computer Vision, pp

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H.,et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European Conference on Computer Vision, pp. 38–55 (2025). Springer

  14. [14]

    In: Advances in Neural Information Processing Systems, vol

    Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y.,et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In: Advances in Neural Information Processing Systems, vol. 36 (2023)

  15. [15]

    In: Advances in Neural Information Processing Systems, vol

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

  16. [16]

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators (2024)

  17. [17]

    In: International Conference on Machine Learning, pp

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR

  18. [18]

    In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp

    Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: Graphmae: Self-supervised masked graph autoencoders. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022)

  19. [19]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Tian, Y., Dong, K., Zhang, C., Zhang, C., Chawla, N.V.: Heterogeneous graph masked autoencoders. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 9997–10005 (2023)

  20. [20]

    In: Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28, pp

    Moura, L.d., Ullrich, S.: The lean 4 theorem prover and programming language. In: Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28, pp. 625–635 (2021). Springer

  21. [21]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

  22. [22]

    Transactions on Machine Learning Research (2022) 38

    Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of large language models. Transactions on Machine Learning Research (2022) 38

  23. [23]

    Transactions on Machine Learning Research (2023)

    Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A.M., Abid, A., Fisch, A., Brown, A.R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al.: Beyond the imita- tion game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023)

  24. [24]

    In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp

    Ganguli, D., Hernandez, D., Lovitt, L., Askell, A., Bai, Y., Chen, A., Conerly, T., Dassarma, N., Drain, D., Elhage, N.,et al.: Predictability and surprise in large generative models. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1747–1764 (2022)

  25. [25]

    Science177(4047), 393–396 (1972)

    Anderson, P.W.: More is different: Broken symmetry and the nature of the hierarchical structure of science. Science177(4047), 393–396 (1972)

  26. [26]

    Nature materials11(2), 103–113 (2012)

    Hwang, H.Y., Iwasa, Y., Kawasaki, M., Keimer, B., Nagaosa, N., Tokura, Y.: Emergent phenomena at oxide interfaces. Nature materials11(2), 103–113 (2012)

  27. [27]

    Simon & Schuster, New York (2002)

    Johnson, S.: Emergence: The Connected Lives of Ants, Brains, Cities, and Software. Simon & Schuster, New York (2002)

  28. [28]

    Accessed May 20 (2022)

    Steinhardt, J.: Future ml systems will be qualitatively different. Accessed May 20 (2022)

  29. [29]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

  30. [30]

    In: Advances in Neural Information Processing Systems (2022)

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A.,et al.: Training compute- optimal large language models. In: Advances in Neural Information Processing Systems (2022)

  31. [31]

    Claude-3 Model Card1(1), 4 (2024)

    Anthropic, A.: The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card1(1), 4 (2024)

  32. [32]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al.: Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  33. [33]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  34. [34]

    Nature645(8081), 633–638 (2025)

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X.,et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

  35. [35]

    39 In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. 39 In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113 (2022)

  36. [36]

    In: Advances in Neural Information Processing Systems, vol

    Bachmann, G., Anagnostidis, S., Hofmann, T.: Scaling mlps: A tale of inductive bias. In: Advances in Neural Information Processing Systems, vol. 36 (2023)

  37. [37]

    In: Advances in Neural Information Processing Systems, vol

    Muennighoff, N., Rush, A., Barak, B., Le Scao, T., Tazi, N., Piktus, A., Pyysalo, S., Wolf, T., Raffel, C.A.: Scaling data-constrained language models. In: Advances in Neural Information Processing Systems, vol. 36, pp. 50358–50376 (2023)

  38. [38]

    arXiv:2404.10102 , year=

    Besiroglu, T., Erdil, E., Barnett, M., You, J.: Chinchilla scaling: A replication attempt. arXiv preprint arXiv:2404.10102 (2024)

  39. [39]

    In: Advances in Neural Information Processing Systems, vol

    Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., Morcos, A.: Beyond neural scaling laws: beating power law scaling via data pruning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 19523–19536 (2022)

  40. [40]

    In: International Conference on Machine Learning (2024)

    Sardana, N., Portes, J., Doubov, S., Frankle, J.: Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. In: International Conference on Machine Learning (2024)

  41. [41]

    In: Advances in Neural Information Processing Systems (2024)

    Ruan, Y., Maddison, C.J., Hashimoto, T.: Observational scaling laws and the predictability of language model performance. In: Advances in Neural Information Processing Systems (2024)

  42. [42]

    Y ., Smyrnis, G., Shankar, V ., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al

    Gadre, S.Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al.: Language models scale reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540 (2024)

  43. [43]

    In: Advances in Neural Information Processing Systems, vol

    Du, Z., Zeng, A., Dong, Y., Tang, J.: Understanding emergent abilities of language models from the loss perspective. In: Advances in Neural Information Processing Systems, vol. 37, pp. 53138–53167 (2024)

  44. [44]

    Proceedings of the National Academy of Sciences79(8), 2554–2558 (1982)

    Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences79(8), 2554–2558 (1982)

  45. [45]

    Schaeffer, R., Miranda, B., Koyejo, S.: Are emergent abilities of large language models a mirage?, vol. 36, pp. 55565–55581 (2023)

  46. [46]

    5098–5139 (2024)

    Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H.T., Gurevych, I.: Are emergent abilities in large language models just in-context learning? In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5098–5139 (2024)

  47. [47]

    Oxford University Press, Oxford (2000) 40

    Holland, J.H.: Emergence: From Chaos to Order. Oxford University Press, Oxford (2000) 40

  48. [48]

    Proceedings of the National Academy of Sciences112(27), 8356–8361 (2015)

    O’Dwyer, J.P., Kembel, S.W., Sharpton, T.J.: Backbones of evolutionary history test biodiversity theory for microbes. Proceedings of the National Academy of Sciences112(27), 8356–8361 (2015)

  49. [49]

    Frontiers in Ecology and Evolution7, 445231 (2019)

    Kempes, C.P., Koehl, M., West, G.B.: The scales that limit: the physical boundaries of evolution. Frontiers in Ecology and Evolution7, 445231 (2019)

  50. [50]

    arXiv preprint arXiv:2309.07311 (2023)

    Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M.L., Saphra, N.: Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms. arXiv preprint arXiv:2309.07311 (2023)

  51. [51]

    arXiv preprint arXiv:2406.05335 (2024)

    Nakaishi, K., Nishikawa, Y., Hukushima, K.: Critical phase transition in large language models. arXiv preprint arXiv:2406.05335 (2024)

  52. [52]

    In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp

    Liu, P., Liu, Z., Gao, Z.-F., Gao, D., Zhao, W.X., Li, Y., Ding, B., Wen, J.-R.: Do emergent abilities exist in quantized large language models: An empirical study. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 5174–5190 (2024)

  53. [53]

    arXiv preprint arXiv:2410.01692 (2024)

    Wu, T.-Y., Lo, P.-Y.: U-shaped and inverted-u scaling behind emergent abilities of large language models. arXiv preprint arXiv:2410.01692 (2024)

  54. [54]

    In: Advances in Neural Information Processing Systems, vol

    Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., Hill, F.: Data distributional properties drive emergent in-context learning in transformers. In: Advances in Neural Information Processing Systems, vol. 35, pp. 18878–18891 (2022)

  55. [55]

    44991–45013 (2024)

    Shi, Z., Wei, J., Xu, Z., Liang, Y.: Why larger language models do in-context learning differently? In: International Conference on Machine Learning, pp. 44991–45013 (2024). PMLR

  56. [56]

    In: Advances in Neural Information Processing Systems, vol

    Ravent´ os, A., Paul, M., Chen, F., Ganguli, S.: Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. In: Advances in Neural Information Processing Systems, vol. 36, pp. 14228–14246 (2023)

  57. [57]

    In: Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning@ COLING 2025, pp

    Al-Saeedi, A., H¨ arm¨ a, A.: Emergence of symbolic abstraction heads for in-context learning in large language models. In: Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning@ COLING 2025, pp. 86–96 (2025)

  58. [58]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024)

  59. [59]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open 41 language models. arXiv preprint arXiv:2402.03300 (2024)

  60. [60]

    Understanding the planning of LLM agents: A survey

    Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

  61. [61]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y.-J., Huang, G.: Expel: Llm agents are experiential learners. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19632–19642 (2024)

  62. [62]

    Foundations of Computational mathematics7(3), 331–368 (2007)

    Caponnetto, A., De Vito, E.: Optimal rates for the regularized least-squares algorithm. Foundations of Computational mathematics7(3), 331–368 (2007)

  63. [63]

    Deep Learning Scaling is Predictable, Empirically

    Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M.M.A., Yang, Y., Zhou, Y.: Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017)

  64. [64]

    In: Advances in Neural Information Processing Systems, vol

    Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., Carmon, Y.: Resolving discrep- ancies in compute-optimal scaling of language models. In: Advances in Neural Information Processing Systems, vol. 37, pp. 100535–100570 (2024)

  65. [65]

    2021 , journal=

    Hernandez, D., Kaplan, J., Henighan, T., McCandlish, S.: Scaling laws for transfer. arXiv preprint arXiv:2102.01293 (2021)

  66. [66]

    In: International Conference on Machine Learning, pp

    Aghajanyan, A., Yu, L., Conneau, A., Hsu, W.-N., Hambardzumyan, K., Zhang, S., Roller, S., Goyal, N., Levy, O., Zettlemoyer, L.: Scaling laws for generative mixed-modal language models. In: International Conference on Machine Learning, pp. 265–279 (2023). PMLR

  67. [67]

    In: International Conference on Machine Learning, pp

    Poli, M., Thomas, A.W., Nguyen, E., Ponnusamy, P., Deiseroth, B., Kersting, K., Suzuki, T., Hie, B., Ermon, S., Re, C.,et al.: Mechanistic design and scaling of hybrid architectures. In: International Conference on Machine Learning, pp. 40908–40950 (2024). PMLR

  68. [68]

    In: First Conference on Language Modeling

    Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling

  69. [69]

    arXiv preprint arXiv:2402.07871 , year=

    Krajewski, J., Ludziejewski, J., Adamczewski, K., Pi´ oro, M., Krutul, M., Anto- niak, S., Ciebiera, K., Kr´ ol, K., Odrzyg´ o´ zd´ z, T., Sankowski, P., et al.: Scaling laws for fine-grained mixture of experts. arXiv preprint arXiv:2402.07871 (2024)

  70. [70]

    In: International Conference on Learning Representations (ICLR) (2025)

    Gadre, S.Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S.,et al.: Language models scale reliably with over-training and on downstream tasks. In: International Conference on Learning Representations (ICLR) (2025)

  71. [71]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

    Chen, Z., Wang, S., Xiao, T., Wang, Y., Chen, S., Cai, X., He, J., Wang, J.: 42 Revisiting scaling laws for language models: The role of data quality and training strategies. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 23881–23899 (2025)

  72. [72]

    In: International Conference on Machine Learning, pp

    Sardana, N., Portes, J., Doubov, S., Frankle, J.: Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. In: International Conference on Machine Learning, pp. 43445–43460 (2024). PMLR

  73. [73]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q.V., R´ e, C., Mirhoseini, A.: Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787 (2024)

  74. [74]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Luo, L., Liu, Y., Liu, R., Phatale, S., Guo, M., Lara, H., Li, Y., Shu, L., Zhu, Y., Meng, L., et al.: Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592 (2024)

  75. [75]

    arXiv preprint arXiv:2501.04519 , year=

    Guan, X., Zhang, L.L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., Yang, M.: Rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519 (2025)

  76. [76]

    In: International Conference on Machine Learning, pp

    Bordelon, B., Canatar, A., Pehlevan, C.: Spectrum dependent learning curves in kernel regression and wide neural networks. In: International Conference on Machine Learning, pp. 1024–1034 (2020). PMLR

  77. [77]

    Nature communications12(1), 2914 (2021)

    Canatar, A., Bordelon, B., Pehlevan, C.: Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications12(1), 2914 (2021)

  78. [78]

    Transactions on Machine Learning Research (2023)

    Simon, J.B., Dickens, M., Karkada, D., DeWeese, M.R.: The eigenlearning frame- work: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research (2023)

  79. [79]

    Proceedings of the National Academy of Sciences121(27), 2311878121 (2024)

    Bahri, Y., Dyer, E., Kaplan, J., Lee, J., Sharma, U.: Explaining neural scaling laws. Proceedings of the National Academy of Sciences121(27), 2311878121 (2024)

  80. [80]

    The Annals of Statistics50(2), 949–986 (2022)

    Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.J.: Surprises in high- dimensional ridgeless least squares interpolation. The Annals of Statistics50(2), 949–986 (2022)

Showing first 80 references.