Learn your entropy from informative data: an axiom ensuring the consistent identification of generalized entropies

Andrea Somazzi; Diego Garlaschelli

arxiv: 2301.05660 · v1 · submitted 2023-01-13 · ⚛️ physics.data-an · math.ST· stat.ME· stat.TH

Learn your entropy from informative data: an axiom ensuring the consistent identification of generalized entropies

Andrea Somazzi , Diego Garlaschelli This is my paper

Pith reviewed 2026-05-24 10:20 UTC · model grok-4.3

classification ⚛️ physics.data-an math.STstat.MEstat.TH

keywords generalized entropyRényi entropymaximum likelihood estimationShannon entropyinformation axiomuniform distributionnon-extensive statisticsstatistical inference

0 comments

The pith

A new axiom treating uniform distributions as uninformative selects only Rényi entropy and ensures generalized maximum likelihood always recovers minus Shannon entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an axiom stating that no entropic parameter can be inferred from a completely uninformative uniform probability distribution. When this requirement is imposed on the Uffink-Jizba-Korbel and Hanel-Thurner families, only Rényi entropy remains viable. The axiom permits direct estimation of the entropic parameter from data alone through a generalized maximum-likelihood procedure. It further guarantees that the maximized log-likelihood equals minus the Shannon entropy of the inferred distribution, even when that distribution was obtained by maximizing a different entropy functional.

Core claim

The axiom that uniform distributions carry no information about entropic parameters selects Rényi entropy within the considered families and implies that, in any generalized maximum-entropy framework consistent with the axiom, the maximized log-likelihood is invariably equal to minus the Shannon entropy.

What carries the argument

The axiom that a uniform distribution is completely uninformative for the purpose of identifying entropic parameters.

If this is right

Only Rényi entropy survives the axiom inside the Uffink-Jizba-Korbel and Hanel-Thurner families.
The entropic parameter can be estimated purely from informative data without external system knowledge.
Generalized maximum-entropy inference remains consistent with the classical maximum-likelihood principle.
The value of the maximized log-likelihood is always minus the Shannon entropy regardless of which generalized entropy is maximized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation between entropy choice and parameter estimation could be tested on empirical distributions from systems suspected to be non-ergodic.
The same axiom might be applied to other entropy families not examined in the paper to check whether additional candidates survive.
Numerical checks on synthetic data generated from known Rényi distributions would confirm that the recovered log-likelihood matches minus Shannon entropy.

Load-bearing premise

The uniform distribution must be treated as carrying no information whatsoever about entropic parameters.

What would settle it

A dataset drawn from a uniform distribution on which the generalized maximum-likelihood procedure returns a non-default value for an entropic parameter would falsify the central claim.

Figures

Figures reproduced from arXiv: 2301.05660 by Andrea Somazzi, Diego Garlaschelli.

read the original abstract

Shannon entropy, a cornerstone of information theory, statistical physics and inference methods, is uniquely identified by the Shannon-Khinchin or Shore-Johnson axioms. Generalizations of Shannon entropy, motivated by the study of non-extensive or non-ergodic systems, relax some of these axioms and lead to entropy families indexed by certain `entropic' parameters. In general, the selection of these parameters requires pre-knowledge of the system or encounters inconsistencies. Here we introduce a simple axiom for any entropy family: namely, that no entropic parameter can be inferred from a completely uninformative (uniform) probability distribution. When applied to the Uffink-Jizba-Korbel and Hanel-Thurner entropies, the axiom selects only R\'enyi entropy as viable. It also extends consistency with the Maximum Likelihood principle, which can then be generalized to estimate the entropic parameter purely from data, as we confirm numerically. Remarkably, in a generalized maximum-entropy framework the axiom implies that the maximized log-likelihood always equals minus Shannon entropy, even if the inferred probability distribution maximizes a generalized entropy and not Shannon's, solving a series of problems encountered in previous approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new axiom that no entropic parameter can be inferred from uniform data selects Rényi entropy from the cited families and forces maximized log-likelihood to equal minus Shannon entropy even under generalized maxent.

read the letter

The main takeaway is the axiom that uniform distributions are completely uninformative for entropic parameters. Applied to the Uffink-Jizba-Korbel and Hanel-Thurner families, it leaves only Rényi entropy, and it produces the result that the maximized log-likelihood always equals minus ordinary Shannon entropy regardless of which generalized entropy is being maximized. That identity is the concrete payoff for inference work. The axiom and these two consequences are new; they are not reductions of the earlier results referenced in the abstract. The paper does a clean job of turning the axiom into a selection rule and showing how it restores consistency with maximum likelihood, including a numerical check that the parameter can be estimated from data alone. The algebraic path from axiom to selection and to the likelihood identity is presented as direct, which is a strength if the steps hold up. The soft spot is the axiom's premise itself. Whether uniform data must be treated as giving literally zero information about the parameter depends on how the inference rule from uniform data is formalized; a different parametrization or likelihood construction could in principle allow parameter recovery and break the selection. The abstract does not display the intermediate lemmas or explicit verification across the families, so the derivation needs direct inspection in the full text to confirm there are no hidden assumptions. The claim that this resolves inconsistencies in prior generalized maxent approaches is plausible but would be stronger with side-by-side comparisons. This paper is aimed at people working on non-extensive statistics, generalized entropies, or maxent methods in physics and inference. A reader already interested in consistency conditions for entropy families will get a usable selection principle and the likelihood identity. It deserves a serious referee because the idea is original, the consequences are specific and testable, and the numerical part supplies some grounding even if the axiom remains open to debate. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a new axiom for generalized entropy families: no entropic parameter can be inferred from a completely uninformative (uniform) probability distribution. When imposed on the Uffink-Jizba-Korbel and Hanel-Thurner families, the axiom selects only Rényi entropy. It further claims that this axiom extends consistency with the maximum-likelihood principle, permits data-driven estimation of the entropic parameter, and implies that the maximized log-likelihood in a generalized maximum-entropy setting always equals minus the Shannon entropy, regardless of which entropy is maximized.

Significance. If the derivation holds, the work supplies a data-driven selection criterion among generalized entropies and removes a known inconsistency between generalized maxent and likelihood-based inference. The numerical verification of parameter recovery and the explicit likelihood-Shannon equality constitute concrete, falsifiable contributions.

major comments (2)

[§3.1] §3.1, definition of the inference map: the axiom is load-bearing for both the selection of Rényi entropy and the subsequent log-likelihood identity, yet the precise procedure by which an entropic parameter is 'inferred' from a uniform distribution is stated only informally. Different choices of estimator or likelihood could permit parameter recovery from uniform data, altering which members of the families survive the axiom.
[§4.2, Eq. (22)] §4.2, Eq. (22): the claim that maximized log-likelihood equals -S_Shannon is derived under the new axiom, but the algebraic steps showing independence from the specific generalized entropy (beyond Rényi) are not fully expanded; an explicit intermediate identity linking the maxent Lagrange multiplier to the Shannon functional would strengthen the result.

minor comments (2)

Notation for the entropic parameter q (or α) is introduced inconsistently across the UJK and HT families; a single table comparing the parameter ranges and the action of the axiom would improve readability.
Figure 2 caption should state the sample size and number of Monte-Carlo realizations used for the numerical confirmation of parameter recovery.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will revise the manuscript to improve clarity where needed.

read point-by-point responses

Referee: [§3.1] §3.1, definition of the inference map: the axiom is load-bearing for both the selection of Rényi entropy and the subsequent log-likelihood identity, yet the precise procedure by which an entropic parameter is 'inferred' from a uniform distribution is stated only informally. Different choices of estimator or likelihood could permit parameter recovery from uniform data, altering which members of the families survive the axiom.

Authors: We agree that a more formal definition of the inference map strengthens the presentation. In the revised manuscript we will explicitly define the inference procedure as maximum-likelihood estimation of the entropic parameter from the observed distribution, and we will prove that, under this standard estimator, the uniform distribution yields a flat likelihood independent of the parameter for all members of the families except Rényi entropy. We maintain that the axiom is intended to apply to any estimator capable of distinguishing parameters; the explicit MLE choice removes ambiguity without changing the selection result. revision: yes
Referee: [§4.2, Eq. (22)] §4.2, Eq. (22): the claim that maximized log-likelihood equals -S_Shannon is derived under the new axiom, but the algebraic steps showing independence from the specific generalized entropy (beyond Rényi) are not fully expanded; an explicit intermediate identity linking the maxent Lagrange multiplier to the Shannon functional would strengthen the result.

Authors: The referee is correct that the derivation of Eq. (22) can be made more transparent. We will expand the algebra in the revision by inserting the intermediate identity that equates the Lagrange multiplier of the generalized maxent problem (under the new axiom) directly to the negative Shannon entropy of the maximizing distribution. This step shows explicitly why the maximized log-likelihood equals -S_Shannon independently of which generalized entropy is used, provided the axiom holds. revision: yes

Circularity Check

0 steps flagged

New axiom on uniform distributions selects Rényi entropy and implies likelihood equality as direct consequence, with no reduction to fitted inputs or self-referential definitions

full rationale

The paper introduces its central axiom (no entropic parameter inferable from uniform distributions) as an independent premise, then applies it to the externally cited Uffink-Jizba-Korbel and Hanel-Thurner families to select Rényi entropy. The further claim that maximized log-likelihood equals minus Shannon entropy is presented as a logical implication of this axiom within the generalized maxent setting, not as a redefinition or statistical fit of the same quantity. No equations reduce a derived result to a parameter estimated from the target data, and no load-bearing step relies on self-citation chains. The derivation remains self-contained against the stated axiom and external families.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the newly introduced axiom together with the algebraic properties of the Uffink-Jizba-Korbel and Hanel-Thurner families; no free parameters or new entities are introduced.

axioms (1)

ad hoc to paper No entropic parameter can be inferred from a completely uninformative (uniform) probability distribution
This is the axiom introduced in the abstract and used to select among entropy families.

pith-pipeline@v0.9.0 · 5748 in / 1354 out tokens · 19557 ms · 2026-05-24T10:20:16.542161+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

coincides with the physical entropy derived by Gibbs [3], which in turn generalizes Boltzmann entropy in Eq. (

work page
[2]

This equivalence is not coincidental and is rooted in statistical inference, as Jaynes showed with the introduction of the Maximum Entropy Principle (MEP) [5]

[2]. This equivalence is not coincidental and is rooted in statistical inference, as Jaynes showed with the introduction of the Maximum Entropy Principle (MEP) [5]. The MEP states that, given only a set I of pieces of empirical information about a system (in the physical situation, this typically means the knowledge of a few, macroscopic conserved quantit...

work page
[3]

(10) As a ﬁrst result, it is easy to show that the value θ∗ 1 deﬁned by Eq

as any other para- metric distribution and select the value θ∗ 1 = argmax θ ℓ1(θ), ℓ 1(θ) ≡ lnp1(G∗,θ ). (10) As a ﬁrst result, it is easy to show that the value θ∗ 1 deﬁned by Eq. ( 10) coincides with the value θ1 deﬁned by Eq. ( 9) [14, 15], i.e. θ∗ 1 ≡ θ1 (in our notation, the asterisk next to a parameter will always denote the ML value of that paramet...

work page
[4]

This relationship is very important, because the maximized likelihood is at the basis of model selec- tion criteria [16, 17]: if alternative models (i.e

in the case of soft constraints. This relationship is very important, because the maximized likelihood is at the basis of model selec- tion criteria [16, 17]: if alternative models (i.e. alterna- tive parametric probability distributions) are compared against the same empirical data, the model to be pre- ferred (assuming all models have the same complexit...

work page
[5]

most informative) model has to be pre- ferred

ensures that the ranking of models based on ML is the same as the ranking based on minus their entropy: the least uncertain (i.e. most informative) model has to be pre- ferred. For models with diﬀerent numbers of parame- ters and/or functional forms, the ranking based on likeli- hood/entropy has to be revised by adding a term control- ling for the variabl...

work page
[6]

is inserted into Eq. ( 5), we get L1[P1(θ∗ 1)] = S1[P1(θ∗ 1)] = −ℓ1(θ∗ 1), (13) so that the Lagrangian, evaluated at P1(θ∗ 1), coincides with minus the maximized log-likelihood, and can therefore be used to rank alternative models as well. All the above results indicate that the MEP can be used as a model selection criterion, exactly as the ML principle, ...

work page
[7]

with θ∗ 1 = argmax θ ℓ1(θ), ℓ1(θ) ≡ ∑M m=1 lnp1(G∗ m,θ ) M . (14) It is easy to show that the condition ∂ℓ1(θ)/∂θ |θ∗ 1 = 0 identifyingθ∗ 1 leads to the well-known result ⟨C⟩ = 1 M M∑ m=1 C∗ m (15) where the (arithmetic) sample average of the M observa- tions has emerged. So, in order to ﬁnd the ML parame- ter value θ∗ 1, one should replace Eq. (

work page
[8]

( 15), or equivalently redeﬁneC∗ in Eq

with Eq. ( 15), or equivalently redeﬁneC∗ in Eq. ( 4) as the sample average of {C∗ m}M m=1. In plain words, the sample average is ‘pro- duced’ by the ML principle. On the contrary, within the MEP construction, there is no way of ‘telling’ Eqs. ( 5) and ( 9) what, in case of M observations, the meaning and deﬁnition of C∗ should be. So in this case the ML ...

work page
[9]

is still equal to minus the entropy: S1[P1(θ∗ 1)] = −ℓ1(θ∗ 1), (16) generalizing Eq. ( 12). Note that there is no microcanon- ical counterpart of Eq. ( 16), since Eq. ( 3) cannot be generalized to the case M > 1, unless all the M values {C∗ m}M m=1 are identical. Indeed the microcanonical ensemble cannot be constructed, because by deﬁnition it cannot acco...

work page
[10]

II B to describe the microcanonical distribution under hard constraints has nothing to do with the case q = 0, which is inadmissible)

(note instead that the subscript in the uniform distribution P0 used in Sec. II B to describe the microcanonical distribution under hard constraints has nothing to do with the case q = 0, which is inadmissible). Notably, Jizba and Korbel [20, 21] showed that an entropy of the typef (Uq[P ]) can also be obtained from the SK axioms, provided that SK4 is rel...

work page
[11]

Another approach does allow for the entropic parameters to be inferred from data, again invoking some form of maximization of the generalized entropy [26, 27]

and ( 21), im- plying that, in absence of such knowledge, the entropic parameters cannot be consistently derived purely from data as the other parameters. Another approach does allow for the entropic parameters to be inferred from data, again invoking some form of maximization of the generalized entropy [26, 27]. However, as we show below, this requiremen...

work page
[12]

This requirement comes from a ‘horizontal’ per- spective, in the sense that it holds for each q-entropy in the family

is maximized by Pu. This requirement comes from a ‘horizontal’ per- spective, in the sense that it holds for each q-entropy in the family. Our axiom, on the other hand, provides a 7 ‘vertical’ perspective: among all the q-entropies, none of them has to be preferred when applied to Pu. In other words, the axiom ensures the uninformativeness role of the uni...

work page
[13]

and ( 24) we obtain f (x) = ln x for all q, i.e. the only viable UJK entropy is R´ enyi entropy [28] Sq[P ] ≡ S(ln) q [P ] = ln Uq[P ] = 1 1 − q ln Ω∑ i=1 pq(Gi), (25) where, since the entropy above is the only ‘surviving one’ in the family S(f ) q [P ], we have removed the superscript from the resulting S(ln) q [P ]. From Eq. (

work page
[14]

This entropy is such that, on the uniform distribution Pu, Sq[Pu] = ln Ω, (26) which does not depend on q, as demanded by our axiom

we can con- ﬁrm that this entropy reduces to Shannon entropy in the limit q → 1, a well-known result for R´ enyi entropy. This entropy is such that, on the uniform distribution Pu, Sq[Pu] = ln Ω, (26) which does not depend on q, as demanded by our axiom. Therefore the only viable UJK entropy is R´ enyi entropy . In general, other UJK entropies do not resp...

work page
[15]

Note that Eq

or ( 3). Note that Eq. ( 3) applies in the microcanonical case, but a similar non-extensive scaling of the entropy would be exhibited in the canonical case as well. An important example in this respect is provided by random graphs: the number of all binary graphs on n vertices is Ω = 2 ( n 2), so it is super-exponential [8, 30]. Even when subject to vario...

work page
[16]

To obtain a truly generalized maximum-entropy probability, one should therefore consider non-trace-form entropies

[8], consistently with the fact that the only admissible trace-form HT entropy according to our axiom is Shannon entropy, as we have shown above. To obtain a truly generalized maximum-entropy probability, one should therefore consider non-trace-form entropies. In particular, considering the R´ enyi entropy Sq[P ] which our axiom selects from both the UJK ...

work page
[17]

(note that ⟨C⟩1 = ⟨C⟩). The quantity ⟨C⟩q is sometimes called (normalized) q-mean, and it can be regarded as a mean with respect to the so-called escort (or zooming) probability distribution ˜p(Gi) = pq(Gi)/ ∑ jpq(Gj) [7, 31]. This q-mean has been introduced to extend important properties and relations from the classical (i.e. Shannonian) statistical mech...

work page
[18]

We then get α q = q 1 − q (q ̸= 1), (35) which is the counterpart of Eq

and then summing over i. We then get α q = q 1 − q (q ̸= 1), (35) which is the counterpart of Eq. ( 8). Substituting α q in (33) and singling out pq(Gi) yields pq(Gi,θ ) = [1 − (1 − q)θ ·(C(Gi) − ⟨C⟩q)]1/(1− q) + [∑Ω j=1pq q(Gj,θ ) ]1/(1− q) (36) where we have used the notation [ x]a + ≡ 0 if x< 0, while [x]a + ≡ xa otherwise [7]. Note that the denominato...

work page
[19]

equals the Uﬃnk functional Uq[Pq(θ)] and must also equal the generalized partition function Wq(θ) ≡ Ω∑ i=1 [1 − (1 − q)θ ·(C(Gi) − ⟨C⟩q)]1/(1− q) + (37) sincepq(Gi,θ ) is already normalized via the condition in Eq. ( 35). In other words, Wq(θ) = [ Ω∑ i=1 pq q(Gi,θ ) ] 1/(1− q) =Uq[Pq(θ)]. (38) Finally, the maximum-entropy probability equals pq(Gi,θ ) = [1...

work page
[20]

Notably, 11 other entropies of the UJK family, including Tsallis en- tropy, do not manifest this property

to the case q ̸= 1, generalizing the re- sult that, in a model selection framework, diﬀerent mod- els can be ranked according to their maximized likelihood or, equivalently, to their realized R´ enyi entropy. Notably, 11 other entropies of the UJK family, including Tsallis en- tropy, do not manifest this property. Also the relation- ship in Eq. (

work page
[21]

Therefore, up to this point, it seems that R´ enyi entropy retains all the desirable properties of Shannon entropies

generalizes as follows: Lq[Pq(ψ ∗ q )] = Sq[Pq(ψ ∗ q )] = −ℓq(ψ ∗ q ), (56) relating the value of the Lagrangian attained by Pq(ψ ∗ q ) to the maximized log-likelihood. Therefore, up to this point, it seems that R´ enyi entropy retains all the desirable properties of Shannon entropies. We now consider the case of M > 1 i.i.d. real- izations {G∗ m}M m=1 of...

work page
[22]

The ML estimator determined by Eq

converges if q is such that the distribution is normalizable (which is a basic requirement for this procedure to be consistent [7]). The ML estimator determined by Eq. (

work page
[23]

identiﬁes the distribution’s parameters, irrespective of the converge of any moment. An important consequence of the fact that ⟨C⟩q is no longer equal to the arithmetic mean of the M observa- tions is that in general, for q ̸= 1 and M >1, Sq[Pq(ψ ∗ q )] ̸= −ℓq(ψ ∗ q ), (60) thus failing to generalize Eq. ( 16) to the case q ̸= 1 and Eq. (55) to the case M...

work page
[24]

Recalling from Eq

(with q replaced by q∗ ) plus the additional condition Ω∑ i=1 pq∗ (Gi,ψ ∗ q∗ ) ln [ 1 − (1 − q∗ )ψ ∗ q∗ ·C(Gi) ] (65) = 1 M M∑ m=1 pq∗ (G∗ m,ψ ∗ q∗ ) ln [ 1 − (1 − q∗ )ψ ∗ q∗ ·C(Gm) ] . Recalling from Eq. (

work page
[25]

(67) In other words, the additional ML condition determining q∗ requires that the maximized log-likelihood equals minus Shannon entropy , i.e

that 1− (1−q∗)ψ ∗ q∗ ·C(Gi) = [pq∗ (Gi,ψ ∗ q∗ )Zq∗ (ψ ∗ q∗ )]1− q∗ (66) we obtain the condition Ω∑ i=1 pq∗ (Gi,ψ ∗ q∗ ) lnpq∗ (Gi,ψ ∗ q∗ ) = 1 M M∑ m=1 lnp(G∗ m,ψ ∗ q∗ ). (67) In other words, the additional ML condition determining q∗ requires that the maximized log-likelihood equals minus Shannon entropy , i.e. S1[Pq∗ (ψ ∗ q∗ )] = − ℓq∗ (ψ ∗ q∗ ), (68) r...

work page
[26]

is not replaced, but rather accompanied by Eq. ( 61). Re- markably, the connection between Shannon entropy and log-likelihood at the speciﬁc parameter value ( ψ ∗ q∗,q ∗ ) remains a general result, even for q ̸= 1 and M > 1. This might look quite surprising, because, for q ̸= 1, the log-likelihood is based on the q-exponential distribution that maximizes ...

work page
[27]

should be put in relation with our initial discussion of the axiom SJ3 about system in- dependence. Recall that assuming that the M values {C∗ m}M m=1 come from independent observations is equiv- alent to assuming that there are M identical and indepen- dent copies of the same system, each copy being observed exactly once. Under this assumption of indepen...

work page
[28]

and combine it with Eq. ( 68) to obtain Sq∗ [Pq∗ (ψ ∗ q∗ )] = S1[Pq∗ (ψ ∗ q∗ )] (69) showing that in this particular case the maximum- entropy probability distribution returns coinciding values of Shannon and R´ enyi entropy, even if it maxi- mizes the latter but not the former. This result does not in general for M >1. The remarkable result in Eq. (

work page
[29]

In particular, in order to determine both q∗ and ψ ∗ q∗ , one can consider a range of values for q and, for each value in the range, compute ψ ∗ q according to Eq

has an important consequence for model selection. In particular, in order to determine both q∗ and ψ ∗ q∗ , one can consider a range of values for q and, for each value in the range, compute ψ ∗ q according to Eq. ( 59). This results, for each value of q in a log-likelihood ℓq(ψ ∗ q ) that is only partially maximized, in the sense that the maximization ha...

work page
[30]

The true values of the parameters and their inferred ML estimates ( q∗,ψ ∗ q∗ ) are presented in Table I

is realized. The true values of the parameters and their inferred ML estimates ( q∗,ψ ∗ q∗ ) are presented in Table I. Since the left plot corresponds to qtrue = 1, it is a standard exponential distribution. In such a case, the two curves intersect only for q = 1. By contrast, the other two cases correspond to qtrue ̸= 1 and the two curves intersect in tw...

work page
[31]

Clausius, The London, Edinburgh, and Dublin Philo- sophical Magazine and Journal of Science 12, 241 (1856)

R. Clausius, The London, Edinburgh, and Dublin Philo- sophical Magazine and Journal of Science 12, 241 (1856)

work page
[32]

L. Boltzmann, ¨Uber die Beziehung zwischen dem zweiten Hauptsatze des mechanischen W¨ armetheorie und der Wahrscheinlichkeitsrechnung, respective den S¨ atzen ¨ uber das W¨ armegleichgewicht (Kk Hof-und Staatsdruckerei, 1877)

work page
[33]

J. W. Gibbs, Elementary principles in statistical mechan- ics (Courier Corporation, 2014)

work page 2014
[34]

C. E. Shannon, Bell system technical journal 27, 379 (1948)

work page 1948
[35]

E. T. Jaynes, Physical review 106, 620 (1957)

work page 1957
[36]

K. P. Murphy, Probabilistic machine learning: an intro- duction (MIT press, 2022)

work page 2022
[37]

Tsallis, Introduction to nonextensive statistical me- chanics: approaching a complex world (Springer Science & Business Media, 2009)

C. Tsallis, Introduction to nonextensive statistical me- chanics: approaching a complex world (Springer Science & Business Media, 2009)

work page 2009
[38]

Thurner, R

S. Thurner, R. Hanel, and P. Klimek, Introduction to the theory of complex systems (Oxford University Press, 2018)

work page 2018
[39]

J. M. Amig´ o, S. G. Balogh, and S. Hern´ andez, Entropy 20, 813 (2018)

work page 2018
[40]

A. M. Lopes and J. A. T. Machado, Entropy 22, 1374 (2020)

work page 2020
[41]

Khinchin, New York (1957)

A. Khinchin, New York (1957)

work page 1957
[42]

Karmeshu, Entropy measures, maximum entropy prin- ciple and emerging applications , Vol

J. Karmeshu, Entropy measures, maximum entropy prin- ciple and emerging applications , Vol. 119 (Springer Sci- ence & Business Media, 2003)

work page 2003
[43]

Squartini and D

T. Squartini and D. Garlaschelli, Maximum-Entropy Net- works: Pattern Detection, Network Reconstruction and Graph Combinatorics (Springer, 2017)

work page 2017
[45]

Garlaschelli and M

D. Garlaschelli and M. I. Loﬀredo, Physical Review E 78, 015101 (2008)

work page 2008
[46]

D. R. A. Kenneth P. Burnham, Model Selection and Mul- timodel Inference (Springer New York, NY, 2002)

work page 2002
[47]

P. D. Gr¨ unwald, I. J. Myung, and M. A. Pitt, Advances in minimum description length: Theory and applications (MIT press, 2005)

work page 2005
[48]

Shore and R

J. Shore and R. Johnson, IEEE Transactions on informa- tion theory 26, 26 (1980)

work page 1980
[49]

Uﬃnk, Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics 26, 223 (1995)

J. Uﬃnk, Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics 26, 223 (1995)

work page 1995
[50]

Jizba and J

P. Jizba and J. Korbel, Physical review letters 122, 120601 (2019)

work page 2019
[51]

Jizba and J

P. Jizba and J. Korbel, Physical Review E 101, 042126 (2020)

work page 2020
[52]

A. N. Kolmogorov and G. Castelnuovo, Sur la notion de la moyenne (G. Bardi, tip. della R. Accad. dei Lincei, 1930)

work page 1930
[53]

Nagumo, in Japanese journal of mathematics: trans- actions and abstracts , Vol

M. Nagumo, in Japanese journal of mathematics: trans- actions and abstracts , Vol. 7 (The Mathematical Society of Japan, 1930) pp. 71–79

work page 1930
[54]

Hanel and S

R. Hanel and S. Thurner, EPL (Europhysics Letters) 93, 20006 (2011)

work page 2011
[55]

S. G. Balogh, G. Palla, P. Pollner, and D. Cz´ egel, Scien- tiﬁc reports 10, 1 (2020)

work page 2020
[56]

Plastino, H

A. Plastino, H. Miller, and A. Plastino, Continuum Me- chanics and Thermodynamics 16, 269 (2004)

work page 2004
[57]

Bashkirov, Physical review letters 93, 130601 (2004)

A. Bashkirov, Physical review letters 93, 130601 (2004)

work page 2004
[58]

A. R´ enyi, inProceedings of the Fourth Berkeley Sympo- sium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics (University of California Press, 1961) pp. 547–561

work page 1961
[59]

Tsallis, Journal of statistical physics 52, 479 (1988)

C. Tsallis, Journal of statistical physics 52, 479 (1988)

work page 1988
[60]

Zhang and D

Q. Zhang and D. Garlaschelli, New Journal of Physics 24, 043011 (2022)

work page 2022
[61]

Beck and F

C. Beck and F. Sch¨ ogl, Thermodynamics of chaotic sys- tems: an introduction , 4 (Cambridge University Press, 1995)

work page 1995

[1] [1]

coincides with the physical entropy derived by Gibbs [3], which in turn generalizes Boltzmann entropy in Eq. (

work page

[2] [2]

This equivalence is not coincidental and is rooted in statistical inference, as Jaynes showed with the introduction of the Maximum Entropy Principle (MEP) [5]

[2]. This equivalence is not coincidental and is rooted in statistical inference, as Jaynes showed with the introduction of the Maximum Entropy Principle (MEP) [5]. The MEP states that, given only a set I of pieces of empirical information about a system (in the physical situation, this typically means the knowledge of a few, macroscopic conserved quantit...

work page

[3] [3]

(10) As a ﬁrst result, it is easy to show that the value θ∗ 1 deﬁned by Eq

as any other para- metric distribution and select the value θ∗ 1 = argmax θ ℓ1(θ), ℓ 1(θ) ≡ lnp1(G∗,θ ). (10) As a ﬁrst result, it is easy to show that the value θ∗ 1 deﬁned by Eq. ( 10) coincides with the value θ1 deﬁned by Eq. ( 9) [14, 15], i.e. θ∗ 1 ≡ θ1 (in our notation, the asterisk next to a parameter will always denote the ML value of that paramet...

work page

[4] [4]

This relationship is very important, because the maximized likelihood is at the basis of model selec- tion criteria [16, 17]: if alternative models (i.e

in the case of soft constraints. This relationship is very important, because the maximized likelihood is at the basis of model selec- tion criteria [16, 17]: if alternative models (i.e. alterna- tive parametric probability distributions) are compared against the same empirical data, the model to be pre- ferred (assuming all models have the same complexit...

work page

[5] [5]

most informative) model has to be pre- ferred

ensures that the ranking of models based on ML is the same as the ranking based on minus their entropy: the least uncertain (i.e. most informative) model has to be pre- ferred. For models with diﬀerent numbers of parame- ters and/or functional forms, the ranking based on likeli- hood/entropy has to be revised by adding a term control- ling for the variabl...

work page

[6] [6]

is inserted into Eq. ( 5), we get L1[P1(θ∗ 1)] = S1[P1(θ∗ 1)] = −ℓ1(θ∗ 1), (13) so that the Lagrangian, evaluated at P1(θ∗ 1), coincides with minus the maximized log-likelihood, and can therefore be used to rank alternative models as well. All the above results indicate that the MEP can be used as a model selection criterion, exactly as the ML principle, ...

work page

[7] [7]

with θ∗ 1 = argmax θ ℓ1(θ), ℓ1(θ) ≡ ∑M m=1 lnp1(G∗ m,θ ) M . (14) It is easy to show that the condition ∂ℓ1(θ)/∂θ |θ∗ 1 = 0 identifyingθ∗ 1 leads to the well-known result ⟨C⟩ = 1 M M∑ m=1 C∗ m (15) where the (arithmetic) sample average of the M observa- tions has emerged. So, in order to ﬁnd the ML parame- ter value θ∗ 1, one should replace Eq. (

work page

[8] [8]

( 15), or equivalently redeﬁneC∗ in Eq

with Eq. ( 15), or equivalently redeﬁneC∗ in Eq. ( 4) as the sample average of {C∗ m}M m=1. In plain words, the sample average is ‘pro- duced’ by the ML principle. On the contrary, within the MEP construction, there is no way of ‘telling’ Eqs. ( 5) and ( 9) what, in case of M observations, the meaning and deﬁnition of C∗ should be. So in this case the ML ...

work page

[9] [9]

is still equal to minus the entropy: S1[P1(θ∗ 1)] = −ℓ1(θ∗ 1), (16) generalizing Eq. ( 12). Note that there is no microcanon- ical counterpart of Eq. ( 16), since Eq. ( 3) cannot be generalized to the case M > 1, unless all the M values {C∗ m}M m=1 are identical. Indeed the microcanonical ensemble cannot be constructed, because by deﬁnition it cannot acco...

work page

[10] [10]

II B to describe the microcanonical distribution under hard constraints has nothing to do with the case q = 0, which is inadmissible)

(note instead that the subscript in the uniform distribution P0 used in Sec. II B to describe the microcanonical distribution under hard constraints has nothing to do with the case q = 0, which is inadmissible). Notably, Jizba and Korbel [20, 21] showed that an entropy of the typef (Uq[P ]) can also be obtained from the SK axioms, provided that SK4 is rel...

work page

[11] [11]

Another approach does allow for the entropic parameters to be inferred from data, again invoking some form of maximization of the generalized entropy [26, 27]

and ( 21), im- plying that, in absence of such knowledge, the entropic parameters cannot be consistently derived purely from data as the other parameters. Another approach does allow for the entropic parameters to be inferred from data, again invoking some form of maximization of the generalized entropy [26, 27]. However, as we show below, this requiremen...

work page

[12] [12]

This requirement comes from a ‘horizontal’ per- spective, in the sense that it holds for each q-entropy in the family

is maximized by Pu. This requirement comes from a ‘horizontal’ per- spective, in the sense that it holds for each q-entropy in the family. Our axiom, on the other hand, provides a 7 ‘vertical’ perspective: among all the q-entropies, none of them has to be preferred when applied to Pu. In other words, the axiom ensures the uninformativeness role of the uni...

work page

[13] [13]

and ( 24) we obtain f (x) = ln x for all q, i.e. the only viable UJK entropy is R´ enyi entropy [28] Sq[P ] ≡ S(ln) q [P ] = ln Uq[P ] = 1 1 − q ln Ω∑ i=1 pq(Gi), (25) where, since the entropy above is the only ‘surviving one’ in the family S(f ) q [P ], we have removed the superscript from the resulting S(ln) q [P ]. From Eq. (

work page

[14] [14]

This entropy is such that, on the uniform distribution Pu, Sq[Pu] = ln Ω, (26) which does not depend on q, as demanded by our axiom

we can con- ﬁrm that this entropy reduces to Shannon entropy in the limit q → 1, a well-known result for R´ enyi entropy. This entropy is such that, on the uniform distribution Pu, Sq[Pu] = ln Ω, (26) which does not depend on q, as demanded by our axiom. Therefore the only viable UJK entropy is R´ enyi entropy . In general, other UJK entropies do not resp...

work page

[15] [15]

Note that Eq

or ( 3). Note that Eq. ( 3) applies in the microcanonical case, but a similar non-extensive scaling of the entropy would be exhibited in the canonical case as well. An important example in this respect is provided by random graphs: the number of all binary graphs on n vertices is Ω = 2 ( n 2), so it is super-exponential [8, 30]. Even when subject to vario...

work page

[16] [16]

To obtain a truly generalized maximum-entropy probability, one should therefore consider non-trace-form entropies

[8], consistently with the fact that the only admissible trace-form HT entropy according to our axiom is Shannon entropy, as we have shown above. To obtain a truly generalized maximum-entropy probability, one should therefore consider non-trace-form entropies. In particular, considering the R´ enyi entropy Sq[P ] which our axiom selects from both the UJK ...

work page

[17] [17]

(note that ⟨C⟩1 = ⟨C⟩). The quantity ⟨C⟩q is sometimes called (normalized) q-mean, and it can be regarded as a mean with respect to the so-called escort (or zooming) probability distribution ˜p(Gi) = pq(Gi)/ ∑ jpq(Gj) [7, 31]. This q-mean has been introduced to extend important properties and relations from the classical (i.e. Shannonian) statistical mech...

work page

[18] [18]

We then get α q = q 1 − q (q ̸= 1), (35) which is the counterpart of Eq

and then summing over i. We then get α q = q 1 − q (q ̸= 1), (35) which is the counterpart of Eq. ( 8). Substituting α q in (33) and singling out pq(Gi) yields pq(Gi,θ ) = [1 − (1 − q)θ ·(C(Gi) − ⟨C⟩q)]1/(1− q) + [∑Ω j=1pq q(Gj,θ ) ]1/(1− q) (36) where we have used the notation [ x]a + ≡ 0 if x< 0, while [x]a + ≡ xa otherwise [7]. Note that the denominato...

work page

[19] [19]

equals the Uﬃnk functional Uq[Pq(θ)] and must also equal the generalized partition function Wq(θ) ≡ Ω∑ i=1 [1 − (1 − q)θ ·(C(Gi) − ⟨C⟩q)]1/(1− q) + (37) sincepq(Gi,θ ) is already normalized via the condition in Eq. ( 35). In other words, Wq(θ) = [ Ω∑ i=1 pq q(Gi,θ ) ] 1/(1− q) =Uq[Pq(θ)]. (38) Finally, the maximum-entropy probability equals pq(Gi,θ ) = [1...

work page

[20] [20]

Notably, 11 other entropies of the UJK family, including Tsallis en- tropy, do not manifest this property

to the case q ̸= 1, generalizing the re- sult that, in a model selection framework, diﬀerent mod- els can be ranked according to their maximized likelihood or, equivalently, to their realized R´ enyi entropy. Notably, 11 other entropies of the UJK family, including Tsallis en- tropy, do not manifest this property. Also the relation- ship in Eq. (

work page

[21] [21]

Therefore, up to this point, it seems that R´ enyi entropy retains all the desirable properties of Shannon entropies

generalizes as follows: Lq[Pq(ψ ∗ q )] = Sq[Pq(ψ ∗ q )] = −ℓq(ψ ∗ q ), (56) relating the value of the Lagrangian attained by Pq(ψ ∗ q ) to the maximized log-likelihood. Therefore, up to this point, it seems that R´ enyi entropy retains all the desirable properties of Shannon entropies. We now consider the case of M > 1 i.i.d. real- izations {G∗ m}M m=1 of...

work page

[22] [22]

The ML estimator determined by Eq

converges if q is such that the distribution is normalizable (which is a basic requirement for this procedure to be consistent [7]). The ML estimator determined by Eq. (

work page

[23] [23]

identiﬁes the distribution’s parameters, irrespective of the converge of any moment. An important consequence of the fact that ⟨C⟩q is no longer equal to the arithmetic mean of the M observa- tions is that in general, for q ̸= 1 and M >1, Sq[Pq(ψ ∗ q )] ̸= −ℓq(ψ ∗ q ), (60) thus failing to generalize Eq. ( 16) to the case q ̸= 1 and Eq. (55) to the case M...

work page

[24] [24]

Recalling from Eq

(with q replaced by q∗ ) plus the additional condition Ω∑ i=1 pq∗ (Gi,ψ ∗ q∗ ) ln [ 1 − (1 − q∗ )ψ ∗ q∗ ·C(Gi) ] (65) = 1 M M∑ m=1 pq∗ (G∗ m,ψ ∗ q∗ ) ln [ 1 − (1 − q∗ )ψ ∗ q∗ ·C(Gm) ] . Recalling from Eq. (

work page

[25] [25]

(67) In other words, the additional ML condition determining q∗ requires that the maximized log-likelihood equals minus Shannon entropy , i.e

that 1− (1−q∗)ψ ∗ q∗ ·C(Gi) = [pq∗ (Gi,ψ ∗ q∗ )Zq∗ (ψ ∗ q∗ )]1− q∗ (66) we obtain the condition Ω∑ i=1 pq∗ (Gi,ψ ∗ q∗ ) lnpq∗ (Gi,ψ ∗ q∗ ) = 1 M M∑ m=1 lnp(G∗ m,ψ ∗ q∗ ). (67) In other words, the additional ML condition determining q∗ requires that the maximized log-likelihood equals minus Shannon entropy , i.e. S1[Pq∗ (ψ ∗ q∗ )] = − ℓq∗ (ψ ∗ q∗ ), (68) r...

work page

[26] [26]

is not replaced, but rather accompanied by Eq. ( 61). Re- markably, the connection between Shannon entropy and log-likelihood at the speciﬁc parameter value ( ψ ∗ q∗,q ∗ ) remains a general result, even for q ̸= 1 and M > 1. This might look quite surprising, because, for q ̸= 1, the log-likelihood is based on the q-exponential distribution that maximizes ...

work page

[27] [27]

should be put in relation with our initial discussion of the axiom SJ3 about system in- dependence. Recall that assuming that the M values {C∗ m}M m=1 come from independent observations is equiv- alent to assuming that there are M identical and indepen- dent copies of the same system, each copy being observed exactly once. Under this assumption of indepen...

work page

[28] [28]

and combine it with Eq. ( 68) to obtain Sq∗ [Pq∗ (ψ ∗ q∗ )] = S1[Pq∗ (ψ ∗ q∗ )] (69) showing that in this particular case the maximum- entropy probability distribution returns coinciding values of Shannon and R´ enyi entropy, even if it maxi- mizes the latter but not the former. This result does not in general for M >1. The remarkable result in Eq. (

work page

[29] [29]

In particular, in order to determine both q∗ and ψ ∗ q∗ , one can consider a range of values for q and, for each value in the range, compute ψ ∗ q according to Eq

has an important consequence for model selection. In particular, in order to determine both q∗ and ψ ∗ q∗ , one can consider a range of values for q and, for each value in the range, compute ψ ∗ q according to Eq. ( 59). This results, for each value of q in a log-likelihood ℓq(ψ ∗ q ) that is only partially maximized, in the sense that the maximization ha...

work page

[30] [30]

The true values of the parameters and their inferred ML estimates ( q∗,ψ ∗ q∗ ) are presented in Table I

is realized. The true values of the parameters and their inferred ML estimates ( q∗,ψ ∗ q∗ ) are presented in Table I. Since the left plot corresponds to qtrue = 1, it is a standard exponential distribution. In such a case, the two curves intersect only for q = 1. By contrast, the other two cases correspond to qtrue ̸= 1 and the two curves intersect in tw...

work page

[31] [31]

Clausius, The London, Edinburgh, and Dublin Philo- sophical Magazine and Journal of Science 12, 241 (1856)

R. Clausius, The London, Edinburgh, and Dublin Philo- sophical Magazine and Journal of Science 12, 241 (1856)

work page

[32] [32]

L. Boltzmann, ¨Uber die Beziehung zwischen dem zweiten Hauptsatze des mechanischen W¨ armetheorie und der Wahrscheinlichkeitsrechnung, respective den S¨ atzen ¨ uber das W¨ armegleichgewicht (Kk Hof-und Staatsdruckerei, 1877)

work page

[33] [33]

J. W. Gibbs, Elementary principles in statistical mechan- ics (Courier Corporation, 2014)

work page 2014

[34] [34]

C. E. Shannon, Bell system technical journal 27, 379 (1948)

work page 1948

[35] [35]

E. T. Jaynes, Physical review 106, 620 (1957)

work page 1957

[36] [36]

K. P. Murphy, Probabilistic machine learning: an intro- duction (MIT press, 2022)

work page 2022

[37] [37]

Tsallis, Introduction to nonextensive statistical me- chanics: approaching a complex world (Springer Science & Business Media, 2009)

C. Tsallis, Introduction to nonextensive statistical me- chanics: approaching a complex world (Springer Science & Business Media, 2009)

work page 2009

[38] [38]

Thurner, R

S. Thurner, R. Hanel, and P. Klimek, Introduction to the theory of complex systems (Oxford University Press, 2018)

work page 2018

[39] [39]

J. M. Amig´ o, S. G. Balogh, and S. Hern´ andez, Entropy 20, 813 (2018)

work page 2018

[40] [40]

A. M. Lopes and J. A. T. Machado, Entropy 22, 1374 (2020)

work page 2020

[41] [41]

Khinchin, New York (1957)

A. Khinchin, New York (1957)

work page 1957

[42] [42]

Karmeshu, Entropy measures, maximum entropy prin- ciple and emerging applications , Vol

J. Karmeshu, Entropy measures, maximum entropy prin- ciple and emerging applications , Vol. 119 (Springer Sci- ence & Business Media, 2003)

work page 2003

[43] [43]

Squartini and D

T. Squartini and D. Garlaschelli, Maximum-Entropy Net- works: Pattern Detection, Network Reconstruction and Graph Combinatorics (Springer, 2017)

work page 2017

[44] [45]

Garlaschelli and M

D. Garlaschelli and M. I. Loﬀredo, Physical Review E 78, 015101 (2008)

work page 2008

[45] [46]

D. R. A. Kenneth P. Burnham, Model Selection and Mul- timodel Inference (Springer New York, NY, 2002)

work page 2002

[46] [47]

P. D. Gr¨ unwald, I. J. Myung, and M. A. Pitt, Advances in minimum description length: Theory and applications (MIT press, 2005)

work page 2005

[47] [48]

Shore and R

J. Shore and R. Johnson, IEEE Transactions on informa- tion theory 26, 26 (1980)

work page 1980

[48] [49]

Uﬃnk, Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics 26, 223 (1995)

J. Uﬃnk, Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics 26, 223 (1995)

work page 1995

[49] [50]

Jizba and J

P. Jizba and J. Korbel, Physical review letters 122, 120601 (2019)

work page 2019

[50] [51]

Jizba and J

P. Jizba and J. Korbel, Physical Review E 101, 042126 (2020)

work page 2020

[51] [52]

A. N. Kolmogorov and G. Castelnuovo, Sur la notion de la moyenne (G. Bardi, tip. della R. Accad. dei Lincei, 1930)

work page 1930

[52] [53]

Nagumo, in Japanese journal of mathematics: trans- actions and abstracts , Vol

M. Nagumo, in Japanese journal of mathematics: trans- actions and abstracts , Vol. 7 (The Mathematical Society of Japan, 1930) pp. 71–79

work page 1930

[53] [54]

Hanel and S

R. Hanel and S. Thurner, EPL (Europhysics Letters) 93, 20006 (2011)

work page 2011

[54] [55]

S. G. Balogh, G. Palla, P. Pollner, and D. Cz´ egel, Scien- tiﬁc reports 10, 1 (2020)

work page 2020

[55] [56]

Plastino, H

A. Plastino, H. Miller, and A. Plastino, Continuum Me- chanics and Thermodynamics 16, 269 (2004)

work page 2004

[56] [57]

Bashkirov, Physical review letters 93, 130601 (2004)

A. Bashkirov, Physical review letters 93, 130601 (2004)

work page 2004

[57] [58]

A. R´ enyi, inProceedings of the Fourth Berkeley Sympo- sium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics (University of California Press, 1961) pp. 547–561

work page 1961

[58] [59]

Tsallis, Journal of statistical physics 52, 479 (1988)

C. Tsallis, Journal of statistical physics 52, 479 (1988)

work page 1988

[59] [60]

Zhang and D

Q. Zhang and D. Garlaschelli, New Journal of Physics 24, 043011 (2022)

work page 2022

[60] [61]

Beck and F

C. Beck and F. Sch¨ ogl, Thermodynamics of chaotic sys- tems: an introduction , 4 (Cambridge University Press, 1995)

work page 1995