Learn your entropy from informative data: an axiom ensuring the consistent identification of generalized entropies
Pith reviewed 2026-05-24 10:20 UTC · model grok-4.3
The pith
A new axiom treating uniform distributions as uninformative selects only Rényi entropy and ensures generalized maximum likelihood always recovers minus Shannon entropy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The axiom that uniform distributions carry no information about entropic parameters selects Rényi entropy within the considered families and implies that, in any generalized maximum-entropy framework consistent with the axiom, the maximized log-likelihood is invariably equal to minus the Shannon entropy.
What carries the argument
The axiom that a uniform distribution is completely uninformative for the purpose of identifying entropic parameters.
If this is right
- Only Rényi entropy survives the axiom inside the Uffink-Jizba-Korbel and Hanel-Thurner families.
- The entropic parameter can be estimated purely from informative data without external system knowledge.
- Generalized maximum-entropy inference remains consistent with the classical maximum-likelihood principle.
- The value of the maximized log-likelihood is always minus the Shannon entropy regardless of which generalized entropy is maximized.
Where Pith is reading between the lines
- The separation between entropy choice and parameter estimation could be tested on empirical distributions from systems suspected to be non-ergodic.
- The same axiom might be applied to other entropy families not examined in the paper to check whether additional candidates survive.
- Numerical checks on synthetic data generated from known Rényi distributions would confirm that the recovered log-likelihood matches minus Shannon entropy.
Load-bearing premise
The uniform distribution must be treated as carrying no information whatsoever about entropic parameters.
What would settle it
A dataset drawn from a uniform distribution on which the generalized maximum-likelihood procedure returns a non-default value for an entropic parameter would falsify the central claim.
Figures
read the original abstract
Shannon entropy, a cornerstone of information theory, statistical physics and inference methods, is uniquely identified by the Shannon-Khinchin or Shore-Johnson axioms. Generalizations of Shannon entropy, motivated by the study of non-extensive or non-ergodic systems, relax some of these axioms and lead to entropy families indexed by certain `entropic' parameters. In general, the selection of these parameters requires pre-knowledge of the system or encounters inconsistencies. Here we introduce a simple axiom for any entropy family: namely, that no entropic parameter can be inferred from a completely uninformative (uniform) probability distribution. When applied to the Uffink-Jizba-Korbel and Hanel-Thurner entropies, the axiom selects only R\'enyi entropy as viable. It also extends consistency with the Maximum Likelihood principle, which can then be generalized to estimate the entropic parameter purely from data, as we confirm numerically. Remarkably, in a generalized maximum-entropy framework the axiom implies that the maximized log-likelihood always equals minus Shannon entropy, even if the inferred probability distribution maximizes a generalized entropy and not Shannon's, solving a series of problems encountered in previous approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a new axiom for generalized entropy families: no entropic parameter can be inferred from a completely uninformative (uniform) probability distribution. When imposed on the Uffink-Jizba-Korbel and Hanel-Thurner families, the axiom selects only Rényi entropy. It further claims that this axiom extends consistency with the maximum-likelihood principle, permits data-driven estimation of the entropic parameter, and implies that the maximized log-likelihood in a generalized maximum-entropy setting always equals minus the Shannon entropy, regardless of which entropy is maximized.
Significance. If the derivation holds, the work supplies a data-driven selection criterion among generalized entropies and removes a known inconsistency between generalized maxent and likelihood-based inference. The numerical verification of parameter recovery and the explicit likelihood-Shannon equality constitute concrete, falsifiable contributions.
major comments (2)
- [§3.1] §3.1, definition of the inference map: the axiom is load-bearing for both the selection of Rényi entropy and the subsequent log-likelihood identity, yet the precise procedure by which an entropic parameter is 'inferred' from a uniform distribution is stated only informally. Different choices of estimator or likelihood could permit parameter recovery from uniform data, altering which members of the families survive the axiom.
- [§4.2, Eq. (22)] §4.2, Eq. (22): the claim that maximized log-likelihood equals -S_Shannon is derived under the new axiom, but the algebraic steps showing independence from the specific generalized entropy (beyond Rényi) are not fully expanded; an explicit intermediate identity linking the maxent Lagrange multiplier to the Shannon functional would strengthen the result.
minor comments (2)
- Notation for the entropic parameter q (or α) is introduced inconsistently across the UJK and HT families; a single table comparing the parameter ranges and the action of the axiom would improve readability.
- Figure 2 caption should state the sample size and number of Monte-Carlo realizations used for the numerical confirmation of parameter recovery.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will revise the manuscript to improve clarity where needed.
read point-by-point responses
-
Referee: [§3.1] §3.1, definition of the inference map: the axiom is load-bearing for both the selection of Rényi entropy and the subsequent log-likelihood identity, yet the precise procedure by which an entropic parameter is 'inferred' from a uniform distribution is stated only informally. Different choices of estimator or likelihood could permit parameter recovery from uniform data, altering which members of the families survive the axiom.
Authors: We agree that a more formal definition of the inference map strengthens the presentation. In the revised manuscript we will explicitly define the inference procedure as maximum-likelihood estimation of the entropic parameter from the observed distribution, and we will prove that, under this standard estimator, the uniform distribution yields a flat likelihood independent of the parameter for all members of the families except Rényi entropy. We maintain that the axiom is intended to apply to any estimator capable of distinguishing parameters; the explicit MLE choice removes ambiguity without changing the selection result. revision: yes
-
Referee: [§4.2, Eq. (22)] §4.2, Eq. (22): the claim that maximized log-likelihood equals -S_Shannon is derived under the new axiom, but the algebraic steps showing independence from the specific generalized entropy (beyond Rényi) are not fully expanded; an explicit intermediate identity linking the maxent Lagrange multiplier to the Shannon functional would strengthen the result.
Authors: The referee is correct that the derivation of Eq. (22) can be made more transparent. We will expand the algebra in the revision by inserting the intermediate identity that equates the Lagrange multiplier of the generalized maxent problem (under the new axiom) directly to the negative Shannon entropy of the maximizing distribution. This step shows explicitly why the maximized log-likelihood equals -S_Shannon independently of which generalized entropy is used, provided the axiom holds. revision: yes
Circularity Check
New axiom on uniform distributions selects Rényi entropy and implies likelihood equality as direct consequence, with no reduction to fitted inputs or self-referential definitions
full rationale
The paper introduces its central axiom (no entropic parameter inferable from uniform distributions) as an independent premise, then applies it to the externally cited Uffink-Jizba-Korbel and Hanel-Thurner families to select Rényi entropy. The further claim that maximized log-likelihood equals minus Shannon entropy is presented as a logical implication of this axiom within the generalized maxent setting, not as a redefinition or statistical fit of the same quantity. No equations reduce a derived result to a parameter estimated from the target data, and no load-bearing step relies on self-citation chains. The derivation remains self-contained against the stated axiom and external families.
Axiom & Free-Parameter Ledger
axioms (1)
- ad hoc to paper No entropic parameter can be inferred from a completely uninformative (uniform) probability distribution
Reference graph
Works this paper leans on
-
[1]
coincides with the physical entropy derived by Gibbs [3], which in turn generalizes Boltzmann entropy in Eq. (
-
[2]
[2]. This equivalence is not coincidental and is rooted in statistical inference, as Jaynes showed with the introduction of the Maximum Entropy Principle (MEP) [5]. The MEP states that, given only a set I of pieces of empirical information about a system (in the physical situation, this typically means the knowledge of a few, macroscopic conserved quantit...
-
[3]
(10) As a first result, it is easy to show that the value θ∗ 1 defined by Eq
as any other para- metric distribution and select the value θ∗ 1 = argmax θ ℓ1(θ), ℓ 1(θ) ≡ lnp1(G∗,θ ). (10) As a first result, it is easy to show that the value θ∗ 1 defined by Eq. ( 10) coincides with the value θ1 defined by Eq. ( 9) [14, 15], i.e. θ∗ 1 ≡ θ1 (in our notation, the asterisk next to a parameter will always denote the ML value of that paramet...
-
[4]
in the case of soft constraints. This relationship is very important, because the maximized likelihood is at the basis of model selec- tion criteria [16, 17]: if alternative models (i.e. alterna- tive parametric probability distributions) are compared against the same empirical data, the model to be pre- ferred (assuming all models have the same complexit...
-
[5]
most informative) model has to be pre- ferred
ensures that the ranking of models based on ML is the same as the ranking based on minus their entropy: the least uncertain (i.e. most informative) model has to be pre- ferred. For models with different numbers of parame- ters and/or functional forms, the ranking based on likeli- hood/entropy has to be revised by adding a term control- ling for the variabl...
-
[6]
is inserted into Eq. ( 5), we get L1[P1(θ∗ 1)] = S1[P1(θ∗ 1)] = −ℓ1(θ∗ 1), (13) so that the Lagrangian, evaluated at P1(θ∗ 1), coincides with minus the maximized log-likelihood, and can therefore be used to rank alternative models as well. All the above results indicate that the MEP can be used as a model selection criterion, exactly as the ML principle, ...
-
[7]
with θ∗ 1 = argmax θ ℓ1(θ), ℓ1(θ) ≡ ∑M m=1 lnp1(G∗ m,θ ) M . (14) It is easy to show that the condition ∂ℓ1(θ)/∂θ |θ∗ 1 = 0 identifyingθ∗ 1 leads to the well-known result ⟨C⟩ = 1 M M∑ m=1 C∗ m (15) where the (arithmetic) sample average of the M observa- tions has emerged. So, in order to find the ML parame- ter value θ∗ 1, one should replace Eq. (
-
[8]
( 15), or equivalently redefineC∗ in Eq
with Eq. ( 15), or equivalently redefineC∗ in Eq. ( 4) as the sample average of {C∗ m}M m=1. In plain words, the sample average is ‘pro- duced’ by the ML principle. On the contrary, within the MEP construction, there is no way of ‘telling’ Eqs. ( 5) and ( 9) what, in case of M observations, the meaning and definition of C∗ should be. So in this case the ML ...
-
[9]
is still equal to minus the entropy: S1[P1(θ∗ 1)] = −ℓ1(θ∗ 1), (16) generalizing Eq. ( 12). Note that there is no microcanon- ical counterpart of Eq. ( 16), since Eq. ( 3) cannot be generalized to the case M > 1, unless all the M values {C∗ m}M m=1 are identical. Indeed the microcanonical ensemble cannot be constructed, because by definition it cannot acco...
-
[10]
(note instead that the subscript in the uniform distribution P0 used in Sec. II B to describe the microcanonical distribution under hard constraints has nothing to do with the case q = 0, which is inadmissible). Notably, Jizba and Korbel [20, 21] showed that an entropy of the typef (Uq[P ]) can also be obtained from the SK axioms, provided that SK4 is rel...
-
[11]
and ( 21), im- plying that, in absence of such knowledge, the entropic parameters cannot be consistently derived purely from data as the other parameters. Another approach does allow for the entropic parameters to be inferred from data, again invoking some form of maximization of the generalized entropy [26, 27]. However, as we show below, this requiremen...
-
[12]
is maximized by Pu. This requirement comes from a ‘horizontal’ per- spective, in the sense that it holds for each q-entropy in the family. Our axiom, on the other hand, provides a 7 ‘vertical’ perspective: among all the q-entropies, none of them has to be preferred when applied to Pu. In other words, the axiom ensures the uninformativeness role of the uni...
-
[13]
and ( 24) we obtain f (x) = ln x for all q, i.e. the only viable UJK entropy is R´ enyi entropy [28] Sq[P ] ≡ S(ln) q [P ] = ln Uq[P ] = 1 1 − q ln Ω∑ i=1 pq(Gi), (25) where, since the entropy above is the only ‘surviving one’ in the family S(f ) q [P ], we have removed the superscript from the resulting S(ln) q [P ]. From Eq. (
-
[14]
we can con- firm that this entropy reduces to Shannon entropy in the limit q → 1, a well-known result for R´ enyi entropy. This entropy is such that, on the uniform distribution Pu, Sq[Pu] = ln Ω, (26) which does not depend on q, as demanded by our axiom. Therefore the only viable UJK entropy is R´ enyi entropy . In general, other UJK entropies do not resp...
-
[15]
or ( 3). Note that Eq. ( 3) applies in the microcanonical case, but a similar non-extensive scaling of the entropy would be exhibited in the canonical case as well. An important example in this respect is provided by random graphs: the number of all binary graphs on n vertices is Ω = 2 ( n 2), so it is super-exponential [8, 30]. Even when subject to vario...
-
[16]
[8], consistently with the fact that the only admissible trace-form HT entropy according to our axiom is Shannon entropy, as we have shown above. To obtain a truly generalized maximum-entropy probability, one should therefore consider non-trace-form entropies. In particular, considering the R´ enyi entropy Sq[P ] which our axiom selects from both the UJK ...
-
[17]
(note that ⟨C⟩1 = ⟨C⟩). The quantity ⟨C⟩q is sometimes called (normalized) q-mean, and it can be regarded as a mean with respect to the so-called escort (or zooming) probability distribution ˜p(Gi) = pq(Gi)/ ∑ jpq(Gj) [7, 31]. This q-mean has been introduced to extend important properties and relations from the classical (i.e. Shannonian) statistical mech...
-
[18]
We then get α q = q 1 − q (q ̸= 1), (35) which is the counterpart of Eq
and then summing over i. We then get α q = q 1 − q (q ̸= 1), (35) which is the counterpart of Eq. ( 8). Substituting α q in (33) and singling out pq(Gi) yields pq(Gi,θ ) = [1 − (1 − q)θ ·(C(Gi) − ⟨C⟩q)]1/(1− q) + [∑Ω j=1pq q(Gj,θ ) ]1/(1− q) (36) where we have used the notation [ x]a + ≡ 0 if x< 0, while [x]a + ≡ xa otherwise [7]. Note that the denominato...
-
[19]
equals the Uffink functional Uq[Pq(θ)] and must also equal the generalized partition function Wq(θ) ≡ Ω∑ i=1 [1 − (1 − q)θ ·(C(Gi) − ⟨C⟩q)]1/(1− q) + (37) sincepq(Gi,θ ) is already normalized via the condition in Eq. ( 35). In other words, Wq(θ) = [ Ω∑ i=1 pq q(Gi,θ ) ] 1/(1− q) =Uq[Pq(θ)]. (38) Finally, the maximum-entropy probability equals pq(Gi,θ ) = [1...
-
[20]
to the case q ̸= 1, generalizing the re- sult that, in a model selection framework, different mod- els can be ranked according to their maximized likelihood or, equivalently, to their realized R´ enyi entropy. Notably, 11 other entropies of the UJK family, including Tsallis en- tropy, do not manifest this property. Also the relation- ship in Eq. (
-
[21]
generalizes as follows: Lq[Pq(ψ ∗ q )] = Sq[Pq(ψ ∗ q )] = −ℓq(ψ ∗ q ), (56) relating the value of the Lagrangian attained by Pq(ψ ∗ q ) to the maximized log-likelihood. Therefore, up to this point, it seems that R´ enyi entropy retains all the desirable properties of Shannon entropies. We now consider the case of M > 1 i.i.d. real- izations {G∗ m}M m=1 of...
-
[22]
The ML estimator determined by Eq
converges if q is such that the distribution is normalizable (which is a basic requirement for this procedure to be consistent [7]). The ML estimator determined by Eq. (
-
[23]
identifies the distribution’s parameters, irrespective of the converge of any moment. An important consequence of the fact that ⟨C⟩q is no longer equal to the arithmetic mean of the M observa- tions is that in general, for q ̸= 1 and M >1, Sq[Pq(ψ ∗ q )] ̸= −ℓq(ψ ∗ q ), (60) thus failing to generalize Eq. ( 16) to the case q ̸= 1 and Eq. (55) to the case M...
-
[24]
(with q replaced by q∗ ) plus the additional condition Ω∑ i=1 pq∗ (Gi,ψ ∗ q∗ ) ln [ 1 − (1 − q∗ )ψ ∗ q∗ ·C(Gi) ] (65) = 1 M M∑ m=1 pq∗ (G∗ m,ψ ∗ q∗ ) ln [ 1 − (1 − q∗ )ψ ∗ q∗ ·C(Gm) ] . Recalling from Eq. (
-
[25]
that 1− (1−q∗)ψ ∗ q∗ ·C(Gi) = [pq∗ (Gi,ψ ∗ q∗ )Zq∗ (ψ ∗ q∗ )]1− q∗ (66) we obtain the condition Ω∑ i=1 pq∗ (Gi,ψ ∗ q∗ ) lnpq∗ (Gi,ψ ∗ q∗ ) = 1 M M∑ m=1 lnp(G∗ m,ψ ∗ q∗ ). (67) In other words, the additional ML condition determining q∗ requires that the maximized log-likelihood equals minus Shannon entropy , i.e. S1[Pq∗ (ψ ∗ q∗ )] = − ℓq∗ (ψ ∗ q∗ ), (68) r...
-
[26]
is not replaced, but rather accompanied by Eq. ( 61). Re- markably, the connection between Shannon entropy and log-likelihood at the specific parameter value ( ψ ∗ q∗,q ∗ ) remains a general result, even for q ̸= 1 and M > 1. This might look quite surprising, because, for q ̸= 1, the log-likelihood is based on the q-exponential distribution that maximizes ...
-
[27]
should be put in relation with our initial discussion of the axiom SJ3 about system in- dependence. Recall that assuming that the M values {C∗ m}M m=1 come from independent observations is equiv- alent to assuming that there are M identical and indepen- dent copies of the same system, each copy being observed exactly once. Under this assumption of indepen...
-
[28]
and combine it with Eq. ( 68) to obtain Sq∗ [Pq∗ (ψ ∗ q∗ )] = S1[Pq∗ (ψ ∗ q∗ )] (69) showing that in this particular case the maximum- entropy probability distribution returns coinciding values of Shannon and R´ enyi entropy, even if it maxi- mizes the latter but not the former. This result does not in general for M >1. The remarkable result in Eq. (
-
[29]
has an important consequence for model selection. In particular, in order to determine both q∗ and ψ ∗ q∗ , one can consider a range of values for q and, for each value in the range, compute ψ ∗ q according to Eq. ( 59). This results, for each value of q in a log-likelihood ℓq(ψ ∗ q ) that is only partially maximized, in the sense that the maximization ha...
-
[30]
is realized. The true values of the parameters and their inferred ML estimates ( q∗,ψ ∗ q∗ ) are presented in Table I. Since the left plot corresponds to qtrue = 1, it is a standard exponential distribution. In such a case, the two curves intersect only for q = 1. By contrast, the other two cases correspond to qtrue ̸= 1 and the two curves intersect in tw...
-
[31]
R. Clausius, The London, Edinburgh, and Dublin Philo- sophical Magazine and Journal of Science 12, 241 (1856)
-
[32]
L. Boltzmann, ¨Uber die Beziehung zwischen dem zweiten Hauptsatze des mechanischen W¨ armetheorie und der Wahrscheinlichkeitsrechnung, respective den S¨ atzen ¨ uber das W¨ armegleichgewicht (Kk Hof-und Staatsdruckerei, 1877)
-
[33]
J. W. Gibbs, Elementary principles in statistical mechan- ics (Courier Corporation, 2014)
work page 2014
-
[34]
C. E. Shannon, Bell system technical journal 27, 379 (1948)
work page 1948
-
[35]
E. T. Jaynes, Physical review 106, 620 (1957)
work page 1957
-
[36]
K. P. Murphy, Probabilistic machine learning: an intro- duction (MIT press, 2022)
work page 2022
-
[37]
C. Tsallis, Introduction to nonextensive statistical me- chanics: approaching a complex world (Springer Science & Business Media, 2009)
work page 2009
-
[38]
S. Thurner, R. Hanel, and P. Klimek, Introduction to the theory of complex systems (Oxford University Press, 2018)
work page 2018
-
[39]
J. M. Amig´ o, S. G. Balogh, and S. Hern´ andez, Entropy 20, 813 (2018)
work page 2018
-
[40]
A. M. Lopes and J. A. T. Machado, Entropy 22, 1374 (2020)
work page 2020
- [41]
-
[42]
Karmeshu, Entropy measures, maximum entropy prin- ciple and emerging applications , Vol
J. Karmeshu, Entropy measures, maximum entropy prin- ciple and emerging applications , Vol. 119 (Springer Sci- ence & Business Media, 2003)
work page 2003
-
[43]
T. Squartini and D. Garlaschelli, Maximum-Entropy Net- works: Pattern Detection, Network Reconstruction and Graph Combinatorics (Springer, 2017)
work page 2017
-
[45]
D. Garlaschelli and M. I. Loffredo, Physical Review E 78, 015101 (2008)
work page 2008
-
[46]
D. R. A. Kenneth P. Burnham, Model Selection and Mul- timodel Inference (Springer New York, NY, 2002)
work page 2002
-
[47]
P. D. Gr¨ unwald, I. J. Myung, and M. A. Pitt, Advances in minimum description length: Theory and applications (MIT press, 2005)
work page 2005
-
[48]
J. Shore and R. Johnson, IEEE Transactions on informa- tion theory 26, 26 (1980)
work page 1980
-
[49]
J. Uffink, Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics 26, 223 (1995)
work page 1995
- [50]
- [51]
-
[52]
A. N. Kolmogorov and G. Castelnuovo, Sur la notion de la moyenne (G. Bardi, tip. della R. Accad. dei Lincei, 1930)
work page 1930
-
[53]
Nagumo, in Japanese journal of mathematics: trans- actions and abstracts , Vol
M. Nagumo, in Japanese journal of mathematics: trans- actions and abstracts , Vol. 7 (The Mathematical Society of Japan, 1930) pp. 71–79
work page 1930
- [54]
-
[55]
S. G. Balogh, G. Palla, P. Pollner, and D. Cz´ egel, Scien- tific reports 10, 1 (2020)
work page 2020
-
[56]
A. Plastino, H. Miller, and A. Plastino, Continuum Me- chanics and Thermodynamics 16, 269 (2004)
work page 2004
-
[57]
Bashkirov, Physical review letters 93, 130601 (2004)
A. Bashkirov, Physical review letters 93, 130601 (2004)
work page 2004
-
[58]
A. R´ enyi, inProceedings of the Fourth Berkeley Sympo- sium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics (University of California Press, 1961) pp. 547–561
work page 1961
-
[59]
Tsallis, Journal of statistical physics 52, 479 (1988)
C. Tsallis, Journal of statistical physics 52, 479 (1988)
work page 1988
- [60]
-
[61]
C. Beck and F. Sch¨ ogl, Thermodynamics of chaotic sys- tems: an introduction , 4 (Cambridge University Press, 1995)
work page 1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.