pith. machine review for the scientific record. sign in

arxiv: 2604.28021 · v1 · submitted 2026-04-30 · ⚛️ physics.soc-ph · cs.CL

Recognition: unknown

Universal statistical laws governing culinary design

Authors on Pith no claims yet

Pith reviewed 2026-05-07 04:52 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.CL
keywords culinary designstatistical lawsZipf's lawHeaps' lawMenzerath-Altmann lawgenerative modelsrecipessymbolic systems
0
0 comments X

The pith

Recipes follow Zipf-like scaling and other universal patterns found in languages, emerging from simple reuse and modification rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines thousands of traditional recipes from diverse global cuisines to determine if they exhibit regular statistical behaviors similar to those observed in human languages and other complex systems. The authors identify several scaling laws: ingredient frequencies follow a Zipf-like distribution, the variety of ingredients grows sublinearly with the number of recipes according to Heaps' law, recipe complexity obeys Menzerath-Altmann relations, and macronutrient levels show a log-normal pattern. These regularities are reproduced by minimal generative models relying on preferential reuse of popular ingredients, constrained choices, and incremental changes to existing recipes. A sympathetic reader would care because this implies that culinary creativity is not arbitrary but constrained by universal processes that shape how humans combine elements in symbolic domains, potentially unifying our understanding of culture, language, and design.

Core claim

The central discovery is that recipes form a compositional symbolic system governed by universal statistical laws, including Zipf-like rank-frequency scaling for ingredients, sublinear growth of culinary diversity per Heaps' law, Menzerath-Altmann-type relations for complexity, and log-normal distributions for macronutrients. These patterns arise from generic generative processes involving preferential reuse, constrained sampling, and incremental modification, which suffice to recapitulate the observed structures across cultures without needing culture-specific rules.

What carries the argument

Minimal generative models based on preferential reuse of ingredients, constrained sampling, and incremental modification of recipes, which generate the observed statistical regularities.

If this is right

  • Ingredient frequencies will consistently display power-law scaling across any large collection of recipes.
  • Culinary diversity will increase sublinearly as the corpus of recipes expands.
  • Recipe complexity will relate to the number and size of constituent units according to Menzerath-Altmann laws.
  • Macronutrient concentrations in recipes will follow a log-normal distribution.
  • Simple rules of reuse and modification can generate the complex patterns seen in global cuisines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generative processes might apply to other creative domains such as music composition or fashion design.
  • These laws could guide the development of algorithms for creating new recipes that maintain cultural authenticity while introducing novelty.
  • Understanding these constraints may help in studying how traditions evolve under pressures of availability and preference.

Load-bearing premise

The automatic annotation of recipes into ingredients and other attributes using named entity recognition is accurate and free from systematic biases across different cuisines and languages.

What would settle it

A new, independently annotated corpus of recipes showing ingredient rank-frequency plots that deviate from a straight line on log-log scales, or generative models that cannot reproduce the observed Heaps' law or Menzerath-Altmann relations.

read the original abstract

Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much like languages do through words and syntax. Yet, beneath the apparent diversity of culinary traditions, whether recipes obey statistical laws comparable to those of other symbolic systems remains unknown. Here we analyze a large corpus of traditional recipes spanning global cuisines, annotated using a state-of-the-art named entity recognition algorithm into ingredients, cooking techniques, utensils, and other culinary attributes. We find that ingredient usage exhibits Zipf-like rank-frequency scaling, that culinary diversity grows sublinearly with corpus size in accordance with Heaps' law, and that recipe complexity follows Menzerath-Altmann-type relations between the number and average information of constituent units. Consistent with observations in packaged foods, macronutrient concentrations across recipes also display a log-normal signature. Minimal generative models based on preferential reuse, constrained sampling, and incremental modification recapitulate these regularities, suggesting generic processes that shape recipe architecture across cultures. Together, these findings establish recipes as a compositional symbolic system in which complex structure emerges from simple, constrained generative processes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 3 minor

Summary. The manuscript analyzes a large corpus of traditional recipes from global cuisines, using a state-of-the-art named entity recognition pipeline to annotate ingredients, cooking techniques, utensils, and other attributes. It reports that ingredient usage follows Zipf-like rank-frequency scaling, culinary diversity grows sublinearly with corpus size consistent with Heaps' law, recipe complexity obeys Menzerath-Altmann-type relations between the number and average information content of constituent units, and macronutrient concentrations exhibit log-normal distributions. Minimal generative models based on preferential reuse, constrained sampling, and incremental modification are shown to recapitulate these statistical regularities, suggesting that generic constrained processes shape recipe architecture across cultures.

Significance. If the empirical scalings and the independence of the generative models are robustly established, the work would provide evidence that recipes constitute a compositional symbolic system governed by universal statistical laws analogous to those observed in language, music, and other cultural artifacts. The strength lies in the cross-cultural scope and the attempt to link observations to minimal mechanistic models; successful validation would support the broader claim that complex cultural structures can emerge from simple generative rules without requiring culture-specific explanations. The paper does not include machine-checked proofs or fully reproducible code, but the falsifiable nature of the scaling predictions offers a clear path for future tests.

major comments (4)
  1. [§2.2] §2.2 (Data and Methods - NER Pipeline): The manuscript relies entirely on a single state-of-the-art NER algorithm for extracting ingredients, techniques, and attributes, yet provides no quantitative validation metrics (precision, recall, F1 scores), no per-cuisine performance breakdowns, and no ablation comparing the pipeline to rule-based lists or human re-annotation. Because every reported scaling (Zipf rank-frequency, Heaps' diversity growth, Menzerath-Altmann relations) is computed from the NER output, systematic biases—such as under-detection of rare non-Western ingredients or inconsistent segmentation of compound names—could artifactually produce the claimed power-law and sublinear behaviors even if the underlying recipes lack these regularities.
  2. [§4.1–4.3] §4.1–4.3 (Generative Models): The minimal models are stated to recapitulate the observed regularities, but the text does not clarify whether the free parameters (reuse probability in the preferential-attachment component and sampling-constraint size) were derived independently from theoretical considerations or fitted to the same empirical frequency and diversity curves. If the parameters were tuned to the data, the reproduction is tautological rather than an independent test of the proposed mechanisms; the manuscript must either derive the parameters a priori or demonstrate that the same parameter values emerge from multiple disjoint data subsets.
  3. [§3.1] §3.1 (Results - Zipf and Heaps' Scaling): The rank-frequency plots and type-token curves are presented without reported fit statistics (exponent values with standard errors, R² or Kolmogorov-Smirnov statistics, number of recipes per cuisine, or bootstrap confidence intervals). In the absence of these quantities it is impossible to judge whether the claimed Zipf-like and sublinear behaviors are statistically significant, robust to corpus subsampling, or driven by a few high-frequency cuisines.
  4. [§3.3] §3.3 (Menzerath-Altmann Relations): The claimed relations between the number of constituent units and their average information content are shown graphically but without an explicit functional form, goodness-of-fit measures, or controls for confounding variables such as recipe length or cuisine type. This weakens the assertion that the observed pattern is a genuine Menzerath-Altmann law rather than a generic consequence of length heterogeneity.
minor comments (3)
  1. [Abstract] The abstract refers to a 'state-of-the-art named entity recognition algorithm' without citing the specific model, training corpus, or reference paper.
  2. [Figures 1–3] Figure captions for the scaling plots should explicitly state the number of recipes, the fitting procedure, and any exclusion criteria applied to the data.
  3. [§3.4] The discussion of log-normal macronutrient distributions would benefit from a direct comparison to the packaged-food literature cited in the text, including quantitative parameter values.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We believe the comments will significantly improve the clarity and robustness of our findings. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to implement.

read point-by-point responses
  1. Referee: [§2.2] §2.2 (Data and Methods - NER Pipeline): The manuscript relies entirely on a single state-of-the-art NER algorithm for extracting ingredients, techniques, and attributes, yet provides no quantitative validation metrics (precision, recall, F1 scores), no per-cuisine performance breakdowns, and no ablation comparing the pipeline to rule-based lists or human re-annotation. Because every reported scaling (Zipf rank-frequency, Heaps' diversity growth, Menzerath-Altmann relations) is computed from the NER output, systematic biases—such as under-detection of rare non-Western ingredients or inconsistent segmentation of compound names—could artifactually produce the claimed power-law and sublinear behaviors even if the underlying recipes lack these regularities.

    Authors: We agree that a detailed validation of the NER pipeline is crucial for the credibility of our results. Although the pipeline is based on a published state-of-the-art model, we did not include its performance metrics in the original submission. In the revised version, we will report the precision, recall, and F1 scores from the model's original evaluation, supplemented by our own manual validation on a random sample of 500 recipes stratified by cuisine. We will also include per-cuisine breakdowns and an ablation study against a rule-based ingredient list derived from common culinary databases. These additions will allow readers to assess potential biases in the extraction process. revision: yes

  2. Referee: [§4.1–4.3] §4.1–4.3 (Generative Models): The minimal models are stated to recapitulate the observed regularities, but the text does not clarify whether the free parameters (reuse probability in the preferential-attachment component and sampling-constraint size) were derived independently from theoretical considerations or fitted to the same empirical frequency and diversity curves. If the parameters were tuned to the data, the reproduction is tautological rather than an independent test of the proposed mechanisms; the manuscript must either derive the parameters a priori or demonstrate that the same parameter values emerge from multiple disjoint data subsets.

    Authors: The parameters were selected based on values commonly used in analogous preferential attachment models from network science and linguistics, without direct fitting to our culinary data. To strengthen this, we will revise the manuscript to explicitly state the theoretical motivation for each parameter and demonstrate that the same parameter set reproduces the observed scalings when applied to multiple disjoint subsets of the recipe corpus (e.g., by cuisine or by random splits). This will confirm the independence of the generative process from the specific dataset. revision: yes

  3. Referee: [§3.1] §3.1 (Results - Zipf and Heaps' Scaling): The rank-frequency plots and type-token curves are presented without reported fit statistics (exponent values with standard errors, R² or Kolmogorov-Smirnov statistics, number of recipes per cuisine, or bootstrap confidence intervals). In the absence of these quantities it is impossible to judge whether the claimed Zipf-like and sublinear behaviors are statistically significant, robust to corpus subsampling, or driven by a few high-frequency cuisines.

    Authors: We will enhance the results section by including quantitative fit statistics for all scaling relations. Specifically, we will report the Zipf exponent with standard errors, R² values, Kolmogorov-Smirnov statistics for goodness-of-fit, the number of recipes per cuisine, and bootstrap confidence intervals obtained from 1000 resamples. Additionally, we will show robustness by presenting scaling exponents for subsampled corpora and for individual high-frequency cuisines separately. revision: yes

  4. Referee: [§3.3] §3.3 (Menzerath-Altmann Relations): The claimed relations between the number of constituent units and their average information content are shown graphically but without an explicit functional form, goodness-of-fit measures, or controls for confounding variables such as recipe length or cuisine type. This weakens the assertion that the observed pattern is a genuine Menzerath-Altmann law rather than a generic consequence of length heterogeneity.

    Authors: We will revise this section to include the explicit Menzerath-Altmann functional form (typically of the form y = a * x^b * exp(c * x) or the standard power-law variant) fitted to the data, along with associated goodness-of-fit metrics such as R² and residual analysis. We will also add controls by regressing out recipe length and including cuisine as a covariate in the analysis to demonstrate that the relation persists independently of these factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper reports empirical observations of Zipf-like scaling, Heaps' law, Menzerath-Altmann relations, and log-normal macronutrient distributions directly from NER-annotated recipe data. It then introduces minimal generative models based on independent principles (preferential reuse, constrained sampling, incremental modification) that are said to recapitulate the observed regularities. No equations, parameter-fitting descriptions, or self-citations in the abstract or context reduce any claimed result to a definitional tautology or a fitted reproduction of the same input statistics. The modeling step is presented as explanatory rather than self-referential, with no load-bearing uniqueness theorems or ansatzes imported from prior self-work. The chain is self-contained as data-driven discovery followed by mechanistic simulation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that automated NER produces unbiased culinary annotations and that the chosen corpus represents global cuisines without selection bias. The generative models introduce free parameters for reuse probability, sampling constraints, and modification rates that are not detailed in the abstract. No new physical entities are postulated.

free parameters (2)
  • reuse probability in preferential attachment model
    Required to generate Zipf-like frequencies; value not stated in abstract.
  • sampling constraint size
    Limits ingredient choices per recipe to produce Heaps' law; value not stated.
axioms (2)
  • domain assumption Named entity recognition algorithm correctly identifies culinary entities across cultures
    Invoked when the corpus is annotated into ingredients, techniques, and utensils.
  • domain assumption The collected recipes form a representative sample of traditional global cuisines
    Required for the claim that the observed laws are universal rather than corpus-specific.

pith-pipeline@v0.9.0 · 5518 in / 1736 out tokens · 68051 ms · 2026-05-07T04:52:40.172144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 4 canonical work pages

  1. [1]

    Cooked: A Natural History of Transformation

    Pollan, M. Cooked: A Natural History of Transformation. (Penguin Books, 2014)

  2. [2]

    Catching Fire: How Cooking Made Us Human

    Wrangham, R. Catching Fire: How Cooking Made Us Human. (Basic Books, 2009)

  3. [3]

    E., Bagrow, J

    Ahn, Y.-Y., Ahnert, S. E., Bagrow, J. P. & Barabási, A.-L. Flavor network and the principles of food pairing. Sci. Rep. 1, 196 (2011)

  4. [4]

    Jain, A., Rakhi, N. K. & Bagler, G. Spices form the basis of food pairing in Indian cuisine. arXiv:1502.03815 (2015)

  5. [5]

    Jain, A., Rakhi, N. K. & Bagler, G. Analysis of food pairing in regional cuisines of India. PLoS One 10, (2015)

  6. [6]

    & Bagler, G

    Singh, N. & Bagler, G. Data-driven investigations of culinary patterns in traditional recipes across the world. in 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW) 157–162 (2018)

  7. [7]

    Words and Rules: The Ingredients of Language

    Pinker, S. Words and Rules: The Ingredients of Language. (Basic Books, 1999)

  8. [8]

    The Language Instinct

    Pinker, S. The Language Instinct. (Penguin Random House India Pvt Ltd, 2015)

  9. [9]

    A generative grammar of cooking

    Bagler, G. A generative grammar of cooking. arXiv:2211.09059 (2022)

  10. [10]

    Batra, D. et al. RecipeDB: A resource for exploring recipes. Database 2020, 1–10 (2020)

  11. [11]

    & Bagler, G

    Kalra, J., Batra, D., Diwan, N. & Bagler, G. Nutritional Profile Estimation in Cooking Recipes. in 36th IEEE International Conference on Data Engineering Workshops (ICDEW) 82–87 (2020)

  12. [12]

    & Bagler, G

    Diwan, N., Batra, D. & Bagler, G. A Named Entity Based Approach to Model Recipes. in 36th IEEE International Conference on Data Engineering (2020)

  13. [13]

    & Bagler, G

    Agarwal, Y., Batra, D. & Bagler, G. Building Hierarchically Disentangled Language Models for Text Generation with Named Entities. in 28th International Conference on Computational Linguistics (COLING) 1–12 (2020)

  14. [14]

    Goel, M. et al. Deep Learning Based Named Entity Recognition Models for Recipes. in LREC- COLING 4542–4554 (2024)

  15. [15]

    & Barabási, A

    Menichetti, G. & Barabási, A. L. Nutrient concentrations in food display universal behaviour. Nat. Food 3, 375–382 (2022)

  16. [16]

    Altmann, E. G. & Gerlach, M. Statistical laws in linguistics. arXiv:1502.03296v1 1–12 (2015)

  17. [17]

    Newman, M. E. J. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 323– 351 (2005)

  18. [18]

    Untangling Herdan’s Law and Heaps’ Law: Mathematical and Informetric Arguments

    Egghe, L. Untangling Herdan’s Law and Heaps’ Law: Mathematical and Informetric Arguments. J. Am. Soc. Inf. Sci. Technol. 58, 702–709 (2007)

  19. [19]

    Prolegomena to Menzerath’s law

    Altmann, G. Prolegomena to Menzerath’s law. Glottometrika 2, 1–10 (1980)

  20. [20]

    The parameters of the Altmann-Menzerath law

    Cramer, I. The parameters of the Altmann-Menzerath law. J. Quant. Linguist. 12, 41–52 (2005)

  21. [21]

    Bettencourt, L. M. A., Lobo, J., Helbing, D. & West, G. B. Growth, innovation, scaling, and the pace of life in cities. Proc. Natl. Acad. Sci. 104, 7301–7306 (2007)

  22. [22]

    Ribeiro, F. L. & Netto, V. M. Urban Scaling Laws. Arxiv Prepr. arXiv 2404.02642 1–23 (2024)

  23. [23]

    Power laws in citation distributions: evidence from Scopus

    Brzezinski, M. Power laws in citation distributions: evidence from Scopus. Scientometrics 103, 213–228 (2015)

  24. [24]

    & Gallegati, M

    Di Guilmi, C., Gaffeo, E. & Gallegati, M. Power Law Scaling in World Income Distribution. Econ. Bull. 15, 1–7 (2003)

  25. [25]

    & Albert, R

    Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. Science (80-. ). 286, 509–512 (1999)

  26. [26]

    & Altmann, E

    Gerlach, M. & Altmann, E. G. Stochastic Model for the Vocabulary Growth in Natural Languages. Phys. Rev. X 3, 021006 (2013). 13

  27. [27]

    W., Holanda, A

    Kinouchi, O., Diez-Garcia, R. W., Holanda, A. J., Zambianchi, P. & Roque, A. C. The non- equilibrium nature of culinary evolution. New J. Phys. 10, 073020 (2008)

  28. [28]

    & Bagler, G

    Jain, A. & Bagler, G. Culinary evolution models for Indian cuisines. Physica A 503, 170–176 (2018)

  29. [29]

    & Bagler, G

    Tuwani, R., Sahoo, N., Singh, N. & Bagler, G. Computational models for the evolution of world cuisines. in 35th IEEE International Conference on Data Engineering Workshops (ICDEW) 85–90 (2019)

  30. [30]

    Zhu, Y. X. et al. Geography and similarity of regional cuisines in China. PLoS One 8, e79161 (2013)

  31. [31]

    Bellingeri, M. et al. The recipe similarity network: a new algorithm to extract relevant information from cookbooks. Sci. Rep. 15, (2025)

  32. [32]

    & Jain, R

    Min, W., Jiang, S., Liu, L., Rui, Y. & Jain, R. A survey on food computing. ACM Comput. Surv. 52, (2019)

  33. [33]

    & Değerli, A

    Doğan, M. & Değerli, A. H. Computational gastronomy: A study to test the food pairing hypothesis in Turkish cuisine. Int. J. Gastron. Food Sci. 33, 100795 (2023)

  34. [34]

    Caprioli, C. et al. The networks of ingredient combinations as culinary fingerprints of world cuisines. npj Sci. Food 9, (2025)

  35. [35]

    Goel, M. et al. Ratatouille: A tool for Novel Recipe Generation. in 36th IEEE International Conference on Data Engineering Workshops (ICDEW) 1–4 (2022)

  36. [36]

    Lee, H. et al. RecipeGPT: Generative Pre-training Based Cooking Recipe Generation and Evaluation System. in The Web Conference 2020 - Companion of the World Wide Web Conference, WWW 2020 181–184 (2020)

  37. [37]

    & Lehner, W

    Reusch, A., Weber, A., Thiele, M. & Lehner, W. RecipeGM: A Hierarchical Recipe Generation Model. in Proceedings - 2021 IEEE 37th International Conference on Data Engineering Workshops, ICDEW 2021 24–29 (2021)

  38. [38]

    & Goel, M

    Bagler, G. & Goel, M. Computational gastronomy: capturing culinary creativity by making food computable. npj Syst. Biol. Appl. 10, 1–4 (2024)

  39. [39]

    & Bagler, G

    Goel, M. & Bagler, G. Computational gastronomy: A data science approach to food. J. Biosci. 47, 10 (2022)

  40. [40]

    & Besold, T

    Akujuobi, U., Liu, S. & Besold, T. R. Revisiting named entity recognition in food computing: enhancing performance and robustness. Artif. Intell. Rev. 57, 1–34 (2024)

  41. [41]

    Clauset, A., Shalizi, C. R. & Newman, M. E. J. Power-law distributions in empirical data. SIAM Rev. 51, (2009)

  42. [42]

    & Barabási, A.-L

    Hooton, F., Menichetti, G. & Barabási, A.-L. Exploring food contents in scientific literature with foodmine. Sci. Rep. 10, (2020)

  43. [43]

    & Loscalzo, J

    Barabási, A.-L., Menichetti, G. & Loscalzo, J. The unmapped chemical complexity of our diet. Nat. Food 1, 33–37 (2019)

  44. [44]

    & Plenz, D

    Alstott, J., Bullmore, E. & Plenz, D. powerlaw: A Python Package for Analysis of Heavy- Tailed Distributions. PLoS One 9, e85777 (2014). 14 Methods Dataset collection and consolidation We assembled a corpus of recipes representing culinary traditions across the world 10 (Fig. 1). Recipes were aggregated from a wide variety of public online repositories10....

  45. [45]

    o Supplementary Table 2 | The geo-cultural mappings of the recipes at the level of continent, region and sub-region (country)

    SI Tables o Supplementary Table 1 | Number of recipes across cuisines at the regions-level. o Supplementary Table 2 | The geo-cultural mappings of the recipes at the level of continent, region and sub-region (country). o Supplementary Table 3 | Performance of the deep-learning based named entit y recognition models. o Supplementary Table 4 | The list 100 ...

  46. [46]

    1 | Global culinary corpus and recipes as complex compositional systems

    SI Figures o Supplementary Fig. 1 | Global culinary corpus and recipes as complex compositional systems. o Supplementary Fig. 2 | Transformer -based architecture for culinary named entity recognition. o Supplementary Fig. 3 | Menzerath–Altmann scaling in culinary design across world cuisines. o Supplementary Fig. 4 | Log -normal organization and scale inv...

  47. [47]

    Deep Learning Based Named Entity Recognition Models for Recipes

    SUPPLEMENTARY INFORMATION TABLES Supplementary Table 1 | Number of recipes across cuisines at the regions -level. The corpus is representative of diverse global culinary practices. Region Number of Recipes Italian 16574 Mexican 14447 South American 7171 Canadian 6694 Indian Subcontinent 6463 French 6375 Chinese and Mongolian 5888 Australian 5819 US 5025 U...

  48. [48]

    1 tablespoon fresh vegetable oil

    SUPPLEMENTARY INFORMATION FIGURES 41 Supplementary Fig. 1 | Global culinary corpus and recipes as complex compositional systems. a, Global distribution of the culinary dataset: The recipe corpus comprises 118,083 recipes spanning 26 regional cuisines across 7 5 countries, representing a broad spectrum of geographical, cultural, and climatic diversity. The...