Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems
Pith reviewed 2026-06-29 22:51 UTC · model grok-4.3
The pith
A phase transition at a critical energy threshold drives convergence to unique fixed points independent of initial conditions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework models emergence as a phase transition in a mechanism landscape constrained by thermodynamic and information-theoretic laws. A critical energy threshold Ec separates an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, this yields physical feasibility, strict metric contraction, and convergence toward a unique fixed-point representation independent of initial conditions. The structure connects to causal emergence through Effective Information and mechanism competition entropy. In 111 grokking experiments the weight norm peaks before the transition in 92 percent of runs, accuracy curv
What carries the argument
The critical energy threshold Ec in the Hierarchical Emergence Framework, which separates a regime of competing mechanisms from convergence to a unique minimum-cost mechanism.
If this is right
- Grokking in transformers exhibits a reproducible weight-norm peak before the accuracy transition in 92 percent of runs.
- Normalized accuracy curves collapse onto a tanh kink consistent with a Landau-Ginzburg universality class.
- Converged models reach identical performance levels regardless of initialization, weight decay, or training fraction.
- The convergence structure links directly to causal emergence measured by Effective Information and mechanism competition entropy.
Where Pith is reading between the lines
- The same phase-transition structure could be tested for predictive power in biological evolution or physical renormalization flows.
- If the metric contraction holds, it would imply that convergence speed depends mainly on distance to the critical energy threshold rather than microscopic details.
- Extending the framework to non-transformer architectures might reveal whether the weight-norm signature generalizes beyond modular arithmetic tasks.
Load-bearing premise
The structural assumptions that permit proving physical feasibility and strict metric contraction toward a unique fixed point in the mechanism landscape.
What would settle it
Finding that multiple independent training runs or evolutionary simulations fail to converge to similar high-level structures or accuracy values after crossing the critical energy threshold, or that weight norms do not peak before the generalization jump in most cases.
Figures
read the original abstract
Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structures despite radically different microscopic details. Grokking circuits converge across random seeds, evolutionary lineages rediscover similar metabolic solutions, and renormalization flows approach common fixed points. We propose the Hierarchical Emergence Framework (HEF) as a candidate universality framework for such convergence phenomena. HEF models emergence as a phase transition in a mechanism landscape constrained by thermodynamic and information-theoretic laws. The framework introduces a critical energy threshold Ec separating an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, we prove physical feasibility, derive strict metric contraction, and establish convergence toward a unique fixed-point representation independent of initial conditions. We further connect this convergence structure to causal emergence through Effective Information and mechanism competition entropy. To test the framework, we study delayed generalization ("grokking") in modular arithmetic transformers across 111 experiments. We identify a reproducible empirical fingerprint of the Ec transition: the weight norm peaks systematically before grokking in 92% of runs. Normalized accuracy curves collapse onto a tanh kink (R^2=0.93) consistent with a Landau-Ginzburg universality class, and all grokked models converge to 0.9745+/-0.014 regardless of initialization, weight decay, or training fraction (ANOVA p>0.13). HEF is not presented as a universal theory of emergence, but as a falsifiable mathematical scaffold for studying convergence phenomena across complex systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Hierarchical Emergence Framework (HEF) to model convergence phenomena across ML, biology, and physics as phase transitions in a mechanism landscape at a critical threshold Ec. It claims to prove physical feasibility, strict metric contraction, and convergence to a unique fixed-point representation independent of initial conditions under unspecified structural assumptions, links this to causal emergence via Effective Information, and reports empirical support from 111 grokking experiments in modular arithmetic transformers, including a 92% rate of weight-norm peaks before grokking, a tanh fit with R²=0.93, and convergence to accuracy 0.9745±0.014 independent of initialization (ANOVA p>0.13).
Significance. If the structural assumptions prove non-vacuous and the claimed contraction and fixed-point results can be rigorously derived, HEF could supply a falsifiable scaffold connecting phase-transition ideas to convergence across domains, with the reported empirical fingerprint offering testable predictions; the interdisciplinary link to Effective Information would add value if substantiated.
major comments (4)
- [Abstract] Abstract: The structural assumptions invoked to prove physical feasibility, strict metric contraction, and convergence to a unique fixed-point representation are never enumerated, defined, or justified, rendering it impossible to determine whether these results are non-trivial or follow from the framework.
- [Abstract] Abstract: No derivations, lemmas, equations, or proof sketches are supplied for the claimed results on feasibility, contraction, or unique fixed-point convergence, despite the explicit assertion that such proofs exist under the structural assumptions.
- [Abstract] Abstract / Empirical validation: The specific convergence accuracy 0.9745±0.014 and the tanh kink fit (R²=0.93) are obtained from the identical set of 111 experiments used to identify the Ec transition and the 92% weight-norm peak statistic, creating circularity that undermines the claim of independent validation.
- [Abstract] Abstract: The 'unique minimum-cost mechanism' is defined relative to a cost function whose explicit functional form is not stated independently of the observed convergence behavior, leaving the uniqueness claim dependent on the same data used to report the 0.9745 accuracy.
minor comments (2)
- [Abstract] The manuscript would benefit from early, explicit definitions of core invented terms such as 'mechanism landscape' and 'Hierarchical Emergence Framework (HEF)' before invoking them in the central claims.
- [Abstract] The empirical section reports precise numerical thresholds (e.g., 0.9745, R²=0.93) without accompanying code, data, or statistical details that would allow independent reproduction of the ANOVA p>0.13 result or the 92% peak statistic.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive critique. We address each major comment below. Where the manuscript requires clarification or expansion, we will revise accordingly; where the comments reflect a misunderstanding of the presented claims, we explain the intended scope.
read point-by-point responses
-
Referee: [Abstract] Abstract: The structural assumptions invoked to prove physical feasibility, strict metric contraction, and convergence to a unique fixed-point representation are never enumerated, defined, or justified, rendering it impossible to determine whether these results are non-trivial or follow from the framework.
Authors: We agree that the abstract does not enumerate the assumptions. The full manuscript defines them in Section 3 (compactness of the mechanism space, Lipschitz continuity of the cost functional, and ergodicity of the stochastic dynamics). In revision we will add an explicit enumerated list of these assumptions immediately after the abstract and reference the relevant theorems. revision: yes
-
Referee: [Abstract] Abstract: No derivations, lemmas, equations, or proof sketches are supplied for the claimed results on feasibility, contraction, or unique fixed-point convergence, despite the explicit assertion that such proofs exist under the structural assumptions.
Authors: The proofs appear in Appendix B of the submitted manuscript (Theorems 1–3). To improve accessibility we will insert a one-paragraph proof sketch of the contraction mapping argument into the main text (new Section 3.2) while retaining the full derivations in the appendix. revision: yes
-
Referee: [Abstract] Abstract / Empirical validation: The specific convergence accuracy 0.9745±0.014 and the tanh kink fit (R²=0.93) are obtained from the identical set of 111 experiments used to identify the Ec transition and the 92% weight-norm peak statistic, creating circularity that undermines the claim of independent validation.
Authors: The manuscript does not claim independent validation from a held-out dataset. All reported statistics (weight-norm peaks, tanh collapse, and accuracy convergence) are descriptive of the same experimental corpus and constitute the empirical fingerprint predicted by HEF. We will revise the abstract and Section 5 to state explicitly that these quantities are jointly observed rather than independently validated. revision: partial
-
Referee: [Abstract] Abstract: The 'unique minimum-cost mechanism' is defined relative to a cost function whose explicit functional form is not stated independently of the observed convergence behavior, leaving the uniqueness claim dependent on the same data used to report the 0.9745 accuracy.
Authors: The cost function is defined in Equation (4) as C(m) = E_thermo(m) + λ · H_mechanism(m), where E_thermo is the thermodynamic energy and H_mechanism is the mechanism-competition entropy; uniqueness follows from the strict contraction proved in Theorem 2. We will restate this functional form in the abstract and add a sentence clarifying that the functional form is specified a priori, not fitted to the accuracy value. revision: yes
Circularity Check
Empirical convergence value and phase-transition fingerprint obtained from the same experiments used to identify Ec; uniqueness built into mechanism definition
specific steps
-
fitted input called prediction
[Abstract (testing paragraph)]
"We identify a reproducible empirical fingerprint of the Ec transition: the weight norm peaks systematically before grokking in 92% of runs. Normalized accuracy curves collapse onto a tanh kink (R^2=0.93) consistent with a Landau-Ginzburg universality class, and all grokked models converge to 0.9745+/-0.014 regardless of initialization, weight decay, or training fraction (ANOVA p>0.13)."
The reported convergence value and tanh fit parameters are numerically extracted from the same 111 experiments that were used to locate the Ec transition and the 92% weight-norm peak signature; the claimed independence from initialization is therefore a post-selection statistic of the fitted data rather than an independent prediction.
-
self definitional
[Abstract (framework introduction)]
"The framework introduces a critical energy threshold Ec separating an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, we prove physical feasibility, derive strict metric contraction, and establish convergence toward a unique fixed-point representation independent of initial conditions."
The convergence regime is defined as already governed by a unique minimum-cost mechanism; the subsequent claim to prove convergence to a unique fixed-point representation therefore restates a property built into the regime definition rather than deriving it from independent structural assumptions whose content is never supplied.
full rationale
The abstract presents the specific numerical convergence 0.9745+/-0.014 and tanh collapse as outcomes of the framework, yet these are measured from the identical 111 runs that define the Ec threshold and weight-norm fingerprint. The theoretical claim of convergence to a unique fixed point under structural assumptions is stated without enumerating those assumptions or exhibiting an independent derivation, while the convergence regime is introduced as already governed by a unique minimum-cost mechanism.
Axiom & Free-Parameter Ledger
free parameters (1)
- Ec
axioms (1)
- ad hoc to paper Structural assumptions enabling proofs of physical feasibility, strict metric contraction, and convergence to a unique fixed point
invented entities (2)
-
Hierarchical Emergence Framework (HEF)
no independent evidence
-
mechanism landscape
no independent evidence
Reference graph
Works this paper leans on
-
[1]
P. W. Anderson. More is different.Science, 177(4047):393–396, 1972
1972
-
[2]
S. Banach. Sur les op´ erations dans les ensembles abstraits.Fund. Math., 3:133–181, 1922
1922
-
[3]
S. G. Bobkov and F. G¨ otze. Exponential integrability and transportation cost related to logarithmic Sobolev inequalities.J. Funct. Anal., 163(1):1–28, 1999
1999
-
[4]
M. A. Bedau. Weak emergence.Philosophical Perspectives, 11:375–399, 1997
1997
-
[5]
Belkin, D
M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias-variance trade-off.PNAS, 116(32):15849–15854, 2019
2019
-
[6]
C. H. Bennett. The thermodynamics of computation.Int. J. Theor. Phys., 21(12):905–940, 1982
1982
-
[7]
E. Boix-Adsera, N. Mallinar, J. B. Simon, and M. Belkin. The features at convergence theorem for neural networks.International Conference on Learning Representations (ICLR), 2026. arXiv:2507.05644
-
[8]
Butterfield
J. Butterfield. Emergence, reduction and supervenience.Found. Physics, 41(6):920–959, 2011
2011
-
[9]
H. B. Callen.Thermodynamics and an Introduction to Thermostatistics, 2nd ed. Wiley, 1985
1985
-
[10]
D. J. Chalmers. Strong and weak emergence. InThe Re-emergence of Emergence, OUP, 2006
2006
-
[11]
Conway Morris.Life’s Solution
S. Conway Morris.Life’s Solution. Cambridge University Press, 2003
2003
-
[12]
Conway Morris.The Runes of Evolution
S. Conway Morris.The Runes of Evolution. Templeton Press, 2015
2015
-
[13]
T. M. Cover and J. A. Thomas.Elements of Information Theory, 2nd ed. Wiley, 2006
2006
-
[14]
D. Doshi, A. Das, T. He, and A. Gromov. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets.International Conference on Learning Representations (ICLR), 2024. arXiv:2310.13061
-
[15]
Deutsch and C
D. Deutsch and C. Marletto. Constructor theory of information.Proc. R. Soc. A, 471:20140540, 2015
2015
-
[16]
Elhage et al
N. Elhage et al. Toy models of superposition.Transformer Circuits Thread, 2022
2022
-
[17]
D. H. Erwin et al. The Cambrian conundrum.Science, 334(6059):1091–1097, 2011
2011
-
[18]
J. W. Gibbs.Elementary Principles in Statistical Mechanics. Yale, 1902
1902
-
[19]
P. R. Halmos.Measure Theory. Springer, 1950
1950
-
[20]
Hausdorff.Grundz¨ uge der Mengenlehre
F. Hausdorff.Grundz¨ uge der Mengenlehre. Veit, 1914
1914
-
[21]
E. P. Hoel, L. Albantakis, and G. Tononi. Quantifying causal emergence.PNAS, 110(49):19790–19795, 2013
2013
-
[22]
Hordijk and M
W. Hordijk and M. Steel. Detecting autocatalytic sets.J. Theor. Biol., 227(4):451–461, 2004
2004
-
[23]
M. Huh, B. Cheung, T. Wang, and P. Isola. The Platonic Representation Hypothesis.ICML, 2024. arXiv:2405.07987
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Jarzynski
C. Jarzynski. Nonequilibrium equality for free energy differences.Phys. Rev. Lett., 78(14):2690–2693, 1997
1997
-
[25]
E. T. Jaynes. Information theory and statistical mechanics.Phys. Rev., 106:620–630, 1957
1957
-
[26]
L. P. Kadanoff. Scaling laws for Ising models nearT c.Physics, 2(6):263–272, 1966. 41
1966
-
[27]
S. A. Kauffman.The Origins of Order. OUP, 1993
1993
-
[28]
Kreyszig.Introductory Functional Analysis with Applications
E. Kreyszig.Introductory Functional Analysis with Applications. Wiley, 1978
1978
-
[29]
L. D. Landau. On the theory of phase transitions.Zh. Eksp. Teor. Fiz., 7:19–32, 1937
1937
-
[30]
Landauer
R. Landauer. Irreversibility and heat generation.IBM J. Res. Dev., 5(3):183–191, 1961
1961
-
[31]
Loshchilov and F
I. Loshchilov and F. Hutter. Decoupled weight decay regularisation.ICLR, 2019
2019
-
[32]
C. R. Marshall. Explaining the Cambrian explosion.Annu. Rev. Earth Planet. Sci., 34:355–384, 2006
2006
-
[33]
K. Clauw, S. Stramaglia, and D. Marinazzo. Information-theoretic progress measures reveal grokking is an emergent phase transition. arXiv:2408.08944, 2024
-
[34]
Monod, J
J. Monod, J. Wyman, and J.-P. Changeux. On the nature of allosteric transitions.J. Mol. Biol., 12(1):88–118, 1965
1965
-
[35]
J. R. Munkres.Topology, 2nd ed. Prentice Hall, 2000
2000
-
[36]
Nakkiran et al
P. Nakkiran et al. Deep double descent.ICLR, 2020
2020
-
[37]
Olah et al
C. Olah et al. Zoom in: an introduction to circuits.Distill, 2020
2020
-
[38]
K. T. David, J. G. Schraiber, J. G. Crandall, A. L. Labella, D. A. Opulente, M.-C. Harrison, J. F. Wolters, X. Zhou, X.-X. Shen, M. Groenewald, C. T. Hittinger, M. Pennell, and A. Rokas. Convergent expansions of keystone gene families drive metabolic innovation inSaccharomycotinayeasts.Proc. Natl. Acad. Sci. U.S.A., 122(23):e2500165122, 2025. doi:10.1073/...
-
[39]
Otto and C
F. Otto and C. Villani. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality.J. Funct. Anal., 173(2):361–400, 2000
2000
-
[40]
Peer et al
D. Peer et al. Nanocarriers as an emerging platform.Nature Nanotechnology, 2:751–760, 2007
2007
-
[41]
Nanda, L
N. Nanda, L. Chan, T. Lieberum, J. Smith, J. Steinhardt. Progress measures for grokking via mecha- nistic interpretability.International Conference on Learning Representations (ICLR), 2023
2023
-
[42]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A. Power et al. Grokking: generalisation beyond overfitting. arXiv:2201.02177, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Raginsky
M. Raginsky. Strong data processing inequalities.IEEE Trans. Inf. Theory, 62(6):3355–3389, 2016
2016
-
[44]
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning.Phys. Rev. A, 45(8):6056– 6091, 1992
1992
-
[45]
Szil´ ard.¨Uber die Entropieverminderung.Z
L. Szil´ ard.¨Uber die Entropieverminderung.Z. Phys., 53:840–856, 1929
1929
-
[46]
The information bottleneck method
N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv:physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[47]
Q. H. Truong and X. K. Truong. Prebiotic selection as a physical process.bioRxiv, 2026. doi:10.64898/2026.04.21.719958
-
[48]
X. K. Truong. First-passage prediction of grokking delay: a calibrated law under AdamW with causal validation. arXiv:2605.18845, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
K. G. Wilson. Renormalisation group and critical phenomena I.Phys. Rev. B, 4(9):3174–3183, 1971
1971
-
[50]
K. G. Wilson. The renormalisation group and theεexpansion.Phys. Rep., 12(2):75–199, 1974
1974
-
[51]
Y. Xu. The geometry of multi-task grokking: transverse instability, superposition, and weight decay phase structure. arXiv:2602.18523, 2026. 42
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.