Recognition: unknown
Universal statistical laws governing culinary design
Pith reviewed 2026-05-07 04:52 UTC · model grok-4.3
The pith
Recipes follow Zipf-like scaling and other universal patterns found in languages, emerging from simple reuse and modification rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that recipes form a compositional symbolic system governed by universal statistical laws, including Zipf-like rank-frequency scaling for ingredients, sublinear growth of culinary diversity per Heaps' law, Menzerath-Altmann-type relations for complexity, and log-normal distributions for macronutrients. These patterns arise from generic generative processes involving preferential reuse, constrained sampling, and incremental modification, which suffice to recapitulate the observed structures across cultures without needing culture-specific rules.
What carries the argument
Minimal generative models based on preferential reuse of ingredients, constrained sampling, and incremental modification of recipes, which generate the observed statistical regularities.
If this is right
- Ingredient frequencies will consistently display power-law scaling across any large collection of recipes.
- Culinary diversity will increase sublinearly as the corpus of recipes expands.
- Recipe complexity will relate to the number and size of constituent units according to Menzerath-Altmann laws.
- Macronutrient concentrations in recipes will follow a log-normal distribution.
- Simple rules of reuse and modification can generate the complex patterns seen in global cuisines.
Where Pith is reading between the lines
- The same generative processes might apply to other creative domains such as music composition or fashion design.
- These laws could guide the development of algorithms for creating new recipes that maintain cultural authenticity while introducing novelty.
- Understanding these constraints may help in studying how traditions evolve under pressures of availability and preference.
Load-bearing premise
The automatic annotation of recipes into ingredients and other attributes using named entity recognition is accurate and free from systematic biases across different cuisines and languages.
What would settle it
A new, independently annotated corpus of recipes showing ingredient rank-frequency plots that deviate from a straight line on log-log scales, or generative models that cannot reproduce the observed Heaps' law or Menzerath-Altmann relations.
read the original abstract
Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much like languages do through words and syntax. Yet, beneath the apparent diversity of culinary traditions, whether recipes obey statistical laws comparable to those of other symbolic systems remains unknown. Here we analyze a large corpus of traditional recipes spanning global cuisines, annotated using a state-of-the-art named entity recognition algorithm into ingredients, cooking techniques, utensils, and other culinary attributes. We find that ingredient usage exhibits Zipf-like rank-frequency scaling, that culinary diversity grows sublinearly with corpus size in accordance with Heaps' law, and that recipe complexity follows Menzerath-Altmann-type relations between the number and average information of constituent units. Consistent with observations in packaged foods, macronutrient concentrations across recipes also display a log-normal signature. Minimal generative models based on preferential reuse, constrained sampling, and incremental modification recapitulate these regularities, suggesting generic processes that shape recipe architecture across cultures. Together, these findings establish recipes as a compositional symbolic system in which complex structure emerges from simple, constrained generative processes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes a large corpus of traditional recipes from global cuisines, using a state-of-the-art named entity recognition pipeline to annotate ingredients, cooking techniques, utensils, and other attributes. It reports that ingredient usage follows Zipf-like rank-frequency scaling, culinary diversity grows sublinearly with corpus size consistent with Heaps' law, recipe complexity obeys Menzerath-Altmann-type relations between the number and average information content of constituent units, and macronutrient concentrations exhibit log-normal distributions. Minimal generative models based on preferential reuse, constrained sampling, and incremental modification are shown to recapitulate these statistical regularities, suggesting that generic constrained processes shape recipe architecture across cultures.
Significance. If the empirical scalings and the independence of the generative models are robustly established, the work would provide evidence that recipes constitute a compositional symbolic system governed by universal statistical laws analogous to those observed in language, music, and other cultural artifacts. The strength lies in the cross-cultural scope and the attempt to link observations to minimal mechanistic models; successful validation would support the broader claim that complex cultural structures can emerge from simple generative rules without requiring culture-specific explanations. The paper does not include machine-checked proofs or fully reproducible code, but the falsifiable nature of the scaling predictions offers a clear path for future tests.
major comments (4)
- [§2.2] §2.2 (Data and Methods - NER Pipeline): The manuscript relies entirely on a single state-of-the-art NER algorithm for extracting ingredients, techniques, and attributes, yet provides no quantitative validation metrics (precision, recall, F1 scores), no per-cuisine performance breakdowns, and no ablation comparing the pipeline to rule-based lists or human re-annotation. Because every reported scaling (Zipf rank-frequency, Heaps' diversity growth, Menzerath-Altmann relations) is computed from the NER output, systematic biases—such as under-detection of rare non-Western ingredients or inconsistent segmentation of compound names—could artifactually produce the claimed power-law and sublinear behaviors even if the underlying recipes lack these regularities.
- [§4.1–4.3] §4.1–4.3 (Generative Models): The minimal models are stated to recapitulate the observed regularities, but the text does not clarify whether the free parameters (reuse probability in the preferential-attachment component and sampling-constraint size) were derived independently from theoretical considerations or fitted to the same empirical frequency and diversity curves. If the parameters were tuned to the data, the reproduction is tautological rather than an independent test of the proposed mechanisms; the manuscript must either derive the parameters a priori or demonstrate that the same parameter values emerge from multiple disjoint data subsets.
- [§3.1] §3.1 (Results - Zipf and Heaps' Scaling): The rank-frequency plots and type-token curves are presented without reported fit statistics (exponent values with standard errors, R² or Kolmogorov-Smirnov statistics, number of recipes per cuisine, or bootstrap confidence intervals). In the absence of these quantities it is impossible to judge whether the claimed Zipf-like and sublinear behaviors are statistically significant, robust to corpus subsampling, or driven by a few high-frequency cuisines.
- [§3.3] §3.3 (Menzerath-Altmann Relations): The claimed relations between the number of constituent units and their average information content are shown graphically but without an explicit functional form, goodness-of-fit measures, or controls for confounding variables such as recipe length or cuisine type. This weakens the assertion that the observed pattern is a genuine Menzerath-Altmann law rather than a generic consequence of length heterogeneity.
minor comments (3)
- [Abstract] The abstract refers to a 'state-of-the-art named entity recognition algorithm' without citing the specific model, training corpus, or reference paper.
- [Figures 1–3] Figure captions for the scaling plots should explicitly state the number of recipes, the fitting procedure, and any exclusion criteria applied to the data.
- [§3.4] The discussion of log-normal macronutrient distributions would benefit from a direct comparison to the packaged-food literature cited in the text, including quantitative parameter values.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We believe the comments will significantly improve the clarity and robustness of our findings. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to implement.
read point-by-point responses
-
Referee: [§2.2] §2.2 (Data and Methods - NER Pipeline): The manuscript relies entirely on a single state-of-the-art NER algorithm for extracting ingredients, techniques, and attributes, yet provides no quantitative validation metrics (precision, recall, F1 scores), no per-cuisine performance breakdowns, and no ablation comparing the pipeline to rule-based lists or human re-annotation. Because every reported scaling (Zipf rank-frequency, Heaps' diversity growth, Menzerath-Altmann relations) is computed from the NER output, systematic biases—such as under-detection of rare non-Western ingredients or inconsistent segmentation of compound names—could artifactually produce the claimed power-law and sublinear behaviors even if the underlying recipes lack these regularities.
Authors: We agree that a detailed validation of the NER pipeline is crucial for the credibility of our results. Although the pipeline is based on a published state-of-the-art model, we did not include its performance metrics in the original submission. In the revised version, we will report the precision, recall, and F1 scores from the model's original evaluation, supplemented by our own manual validation on a random sample of 500 recipes stratified by cuisine. We will also include per-cuisine breakdowns and an ablation study against a rule-based ingredient list derived from common culinary databases. These additions will allow readers to assess potential biases in the extraction process. revision: yes
-
Referee: [§4.1–4.3] §4.1–4.3 (Generative Models): The minimal models are stated to recapitulate the observed regularities, but the text does not clarify whether the free parameters (reuse probability in the preferential-attachment component and sampling-constraint size) were derived independently from theoretical considerations or fitted to the same empirical frequency and diversity curves. If the parameters were tuned to the data, the reproduction is tautological rather than an independent test of the proposed mechanisms; the manuscript must either derive the parameters a priori or demonstrate that the same parameter values emerge from multiple disjoint data subsets.
Authors: The parameters were selected based on values commonly used in analogous preferential attachment models from network science and linguistics, without direct fitting to our culinary data. To strengthen this, we will revise the manuscript to explicitly state the theoretical motivation for each parameter and demonstrate that the same parameter set reproduces the observed scalings when applied to multiple disjoint subsets of the recipe corpus (e.g., by cuisine or by random splits). This will confirm the independence of the generative process from the specific dataset. revision: yes
-
Referee: [§3.1] §3.1 (Results - Zipf and Heaps' Scaling): The rank-frequency plots and type-token curves are presented without reported fit statistics (exponent values with standard errors, R² or Kolmogorov-Smirnov statistics, number of recipes per cuisine, or bootstrap confidence intervals). In the absence of these quantities it is impossible to judge whether the claimed Zipf-like and sublinear behaviors are statistically significant, robust to corpus subsampling, or driven by a few high-frequency cuisines.
Authors: We will enhance the results section by including quantitative fit statistics for all scaling relations. Specifically, we will report the Zipf exponent with standard errors, R² values, Kolmogorov-Smirnov statistics for goodness-of-fit, the number of recipes per cuisine, and bootstrap confidence intervals obtained from 1000 resamples. Additionally, we will show robustness by presenting scaling exponents for subsampled corpora and for individual high-frequency cuisines separately. revision: yes
-
Referee: [§3.3] §3.3 (Menzerath-Altmann Relations): The claimed relations between the number of constituent units and their average information content are shown graphically but without an explicit functional form, goodness-of-fit measures, or controls for confounding variables such as recipe length or cuisine type. This weakens the assertion that the observed pattern is a genuine Menzerath-Altmann law rather than a generic consequence of length heterogeneity.
Authors: We will revise this section to include the explicit Menzerath-Altmann functional form (typically of the form y = a * x^b * exp(c * x) or the standard power-law variant) fitted to the data, along with associated goodness-of-fit metrics such as R² and residual analysis. We will also add controls by regressing out recipe length and including cuisine as a covariate in the analysis to demonstrate that the relation persists independently of these factors. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper reports empirical observations of Zipf-like scaling, Heaps' law, Menzerath-Altmann relations, and log-normal macronutrient distributions directly from NER-annotated recipe data. It then introduces minimal generative models based on independent principles (preferential reuse, constrained sampling, incremental modification) that are said to recapitulate the observed regularities. No equations, parameter-fitting descriptions, or self-citations in the abstract or context reduce any claimed result to a definitional tautology or a fitted reproduction of the same input statistics. The modeling step is presented as explanatory rather than self-referential, with no load-bearing uniqueness theorems or ansatzes imported from prior self-work. The chain is self-contained as data-driven discovery followed by mechanistic simulation.
Axiom & Free-Parameter Ledger
free parameters (2)
- reuse probability in preferential attachment model
- sampling constraint size
axioms (2)
- domain assumption Named entity recognition algorithm correctly identifies culinary entities across cultures
- domain assumption The collected recipes form a representative sample of traditional global cuisines
Reference graph
Works this paper leans on
-
[1]
Cooked: A Natural History of Transformation
Pollan, M. Cooked: A Natural History of Transformation. (Penguin Books, 2014)
2014
-
[2]
Catching Fire: How Cooking Made Us Human
Wrangham, R. Catching Fire: How Cooking Made Us Human. (Basic Books, 2009)
2009
-
[3]
E., Bagrow, J
Ahn, Y.-Y., Ahnert, S. E., Bagrow, J. P. & Barabási, A.-L. Flavor network and the principles of food pairing. Sci. Rep. 1, 196 (2011)
2011
- [4]
-
[5]
Jain, A., Rakhi, N. K. & Bagler, G. Analysis of food pairing in regional cuisines of India. PLoS One 10, (2015)
2015
-
[6]
& Bagler, G
Singh, N. & Bagler, G. Data-driven investigations of culinary patterns in traditional recipes across the world. in 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW) 157–162 (2018)
2018
-
[7]
Words and Rules: The Ingredients of Language
Pinker, S. Words and Rules: The Ingredients of Language. (Basic Books, 1999)
1999
-
[8]
The Language Instinct
Pinker, S. The Language Instinct. (Penguin Random House India Pvt Ltd, 2015)
2015
-
[9]
A generative grammar of cooking
Bagler, G. A generative grammar of cooking. arXiv:2211.09059 (2022)
-
[10]
Batra, D. et al. RecipeDB: A resource for exploring recipes. Database 2020, 1–10 (2020)
2020
-
[11]
& Bagler, G
Kalra, J., Batra, D., Diwan, N. & Bagler, G. Nutritional Profile Estimation in Cooking Recipes. in 36th IEEE International Conference on Data Engineering Workshops (ICDEW) 82–87 (2020)
2020
-
[12]
& Bagler, G
Diwan, N., Batra, D. & Bagler, G. A Named Entity Based Approach to Model Recipes. in 36th IEEE International Conference on Data Engineering (2020)
2020
-
[13]
& Bagler, G
Agarwal, Y., Batra, D. & Bagler, G. Building Hierarchically Disentangled Language Models for Text Generation with Named Entities. in 28th International Conference on Computational Linguistics (COLING) 1–12 (2020)
2020
-
[14]
Goel, M. et al. Deep Learning Based Named Entity Recognition Models for Recipes. in LREC- COLING 4542–4554 (2024)
2024
-
[15]
& Barabási, A
Menichetti, G. & Barabási, A. L. Nutrient concentrations in food display universal behaviour. Nat. Food 3, 375–382 (2022)
2022
- [16]
-
[17]
Newman, M. E. J. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 323– 351 (2005)
2005
-
[18]
Untangling Herdan’s Law and Heaps’ Law: Mathematical and Informetric Arguments
Egghe, L. Untangling Herdan’s Law and Heaps’ Law: Mathematical and Informetric Arguments. J. Am. Soc. Inf. Sci. Technol. 58, 702–709 (2007)
2007
-
[19]
Prolegomena to Menzerath’s law
Altmann, G. Prolegomena to Menzerath’s law. Glottometrika 2, 1–10 (1980)
1980
-
[20]
The parameters of the Altmann-Menzerath law
Cramer, I. The parameters of the Altmann-Menzerath law. J. Quant. Linguist. 12, 41–52 (2005)
2005
-
[21]
Bettencourt, L. M. A., Lobo, J., Helbing, D. & West, G. B. Growth, innovation, scaling, and the pace of life in cities. Proc. Natl. Acad. Sci. 104, 7301–7306 (2007)
2007
- [22]
-
[23]
Power laws in citation distributions: evidence from Scopus
Brzezinski, M. Power laws in citation distributions: evidence from Scopus. Scientometrics 103, 213–228 (2015)
2015
-
[24]
& Gallegati, M
Di Guilmi, C., Gaffeo, E. & Gallegati, M. Power Law Scaling in World Income Distribution. Econ. Bull. 15, 1–7 (2003)
2003
-
[25]
& Albert, R
Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. Science (80-. ). 286, 509–512 (1999)
1999
-
[26]
& Altmann, E
Gerlach, M. & Altmann, E. G. Stochastic Model for the Vocabulary Growth in Natural Languages. Phys. Rev. X 3, 021006 (2013). 13
2013
-
[27]
W., Holanda, A
Kinouchi, O., Diez-Garcia, R. W., Holanda, A. J., Zambianchi, P. & Roque, A. C. The non- equilibrium nature of culinary evolution. New J. Phys. 10, 073020 (2008)
2008
-
[28]
& Bagler, G
Jain, A. & Bagler, G. Culinary evolution models for Indian cuisines. Physica A 503, 170–176 (2018)
2018
-
[29]
& Bagler, G
Tuwani, R., Sahoo, N., Singh, N. & Bagler, G. Computational models for the evolution of world cuisines. in 35th IEEE International Conference on Data Engineering Workshops (ICDEW) 85–90 (2019)
2019
-
[30]
Zhu, Y. X. et al. Geography and similarity of regional cuisines in China. PLoS One 8, e79161 (2013)
2013
-
[31]
Bellingeri, M. et al. The recipe similarity network: a new algorithm to extract relevant information from cookbooks. Sci. Rep. 15, (2025)
2025
-
[32]
& Jain, R
Min, W., Jiang, S., Liu, L., Rui, Y. & Jain, R. A survey on food computing. ACM Comput. Surv. 52, (2019)
2019
-
[33]
& Değerli, A
Doğan, M. & Değerli, A. H. Computational gastronomy: A study to test the food pairing hypothesis in Turkish cuisine. Int. J. Gastron. Food Sci. 33, 100795 (2023)
2023
-
[34]
Caprioli, C. et al. The networks of ingredient combinations as culinary fingerprints of world cuisines. npj Sci. Food 9, (2025)
2025
-
[35]
Goel, M. et al. Ratatouille: A tool for Novel Recipe Generation. in 36th IEEE International Conference on Data Engineering Workshops (ICDEW) 1–4 (2022)
2022
-
[36]
Lee, H. et al. RecipeGPT: Generative Pre-training Based Cooking Recipe Generation and Evaluation System. in The Web Conference 2020 - Companion of the World Wide Web Conference, WWW 2020 181–184 (2020)
2020
-
[37]
& Lehner, W
Reusch, A., Weber, A., Thiele, M. & Lehner, W. RecipeGM: A Hierarchical Recipe Generation Model. in Proceedings - 2021 IEEE 37th International Conference on Data Engineering Workshops, ICDEW 2021 24–29 (2021)
2021
-
[38]
& Goel, M
Bagler, G. & Goel, M. Computational gastronomy: capturing culinary creativity by making food computable. npj Syst. Biol. Appl. 10, 1–4 (2024)
2024
-
[39]
& Bagler, G
Goel, M. & Bagler, G. Computational gastronomy: A data science approach to food. J. Biosci. 47, 10 (2022)
2022
-
[40]
& Besold, T
Akujuobi, U., Liu, S. & Besold, T. R. Revisiting named entity recognition in food computing: enhancing performance and robustness. Artif. Intell. Rev. 57, 1–34 (2024)
2024
-
[41]
Clauset, A., Shalizi, C. R. & Newman, M. E. J. Power-law distributions in empirical data. SIAM Rev. 51, (2009)
2009
-
[42]
& Barabási, A.-L
Hooton, F., Menichetti, G. & Barabási, A.-L. Exploring food contents in scientific literature with foodmine. Sci. Rep. 10, (2020)
2020
-
[43]
& Loscalzo, J
Barabási, A.-L., Menichetti, G. & Loscalzo, J. The unmapped chemical complexity of our diet. Nat. Food 1, 33–37 (2019)
2019
-
[44]
& Plenz, D
Alstott, J., Bullmore, E. & Plenz, D. powerlaw: A Python Package for Analysis of Heavy- Tailed Distributions. PLoS One 9, e85777 (2014). 14 Methods Dataset collection and consolidation We assembled a corpus of recipes representing culinary traditions across the world 10 (Fig. 1). Recipes were aggregated from a wide variety of public online repositories10....
2014
-
[45]
o Supplementary Table 2 | The geo-cultural mappings of the recipes at the level of continent, region and sub-region (country)
SI Tables o Supplementary Table 1 | Number of recipes across cuisines at the regions-level. o Supplementary Table 2 | The geo-cultural mappings of the recipes at the level of continent, region and sub-region (country). o Supplementary Table 3 | Performance of the deep-learning based named entit y recognition models. o Supplementary Table 4 | The list 100 ...
-
[46]
1 | Global culinary corpus and recipes as complex compositional systems
SI Figures o Supplementary Fig. 1 | Global culinary corpus and recipes as complex compositional systems. o Supplementary Fig. 2 | Transformer -based architecture for culinary named entity recognition. o Supplementary Fig. 3 | Menzerath–Altmann scaling in culinary design across world cuisines. o Supplementary Fig. 4 | Log -normal organization and scale inv...
-
[47]
Deep Learning Based Named Entity Recognition Models for Recipes
SUPPLEMENTARY INFORMATION TABLES Supplementary Table 1 | Number of recipes across cuisines at the regions -level. The corpus is representative of diverse global culinary practices. Region Number of Recipes Italian 16574 Mexican 14447 South American 7171 Canadian 6694 Indian Subcontinent 6463 French 6375 Chinese and Mongolian 5888 Australian 5819 US 5025 U...
2040
-
[48]
1 tablespoon fresh vegetable oil
SUPPLEMENTARY INFORMATION FIGURES 41 Supplementary Fig. 1 | Global culinary corpus and recipes as complex compositional systems. a, Global distribution of the culinary dataset: The recipe corpus comprises 118,083 recipes spanning 26 regional cuisines across 7 5 countries, representing a broad spectrum of geographical, cultural, and climatic diversity. The...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.