pith. sign in

arxiv: 2606.25042 · v1 · pith:N5637P2Snew · submitted 2026-06-23 · 💻 cs.IT · math.IT· math.PR· math.ST· stat.ML· stat.TH

Information from coincidences

Pith reviewed 2026-06-25 21:57 UTC · model grok-4.3

classification 💻 cs.IT math.ITmath.PRmath.STstat.MLstat.TH
keywords information theoryvariational methodscoincidence identityRenyi entropyPAC-BayesChernoff informationSanov theoremhypothesis testing
0
0 comments X

The pith

One algebraic identity shows the log of a mixed prior count equals a Boltzmann weight, normalizer, maximum-entropy value, and KL-barycenter optimum at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a single algebraic mixed coincidence identity for any family of priors and real exponents. The log of the expectation of their powered product is shown to equal four different classical objects used in variational information theory. If correct, this identity supplies a common derivation for results on empirical distribution concentration, hypothesis testing exponents, change-of-measure inequalities, and rare coincidence thresholds, while extending the usual one- and two-prior Renyi formulas to any number of priors and to unnormalized or continuum cases.

Core claim

The central claim is that log E_{x∼ν}[∏_{i=1}^W π_i^{α_i}(x)] is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. Specializing the same equality recovers Sanov-type decompositions, Chernoff information, Donsker-Varadhan and PAC-Bayes inequalities, Erdos-Renyi run lengths, rate-distortion thresholds, and birthday problems. The identity generalizes the classical Renyi variational formulas to a W-prior simplex and holds for unnormalized and continuum-indexed priors.

What carries the argument

The mixed count E_{x∼ν}[∏_{i=1}^W π_i^{α_i}(x)] whose logarithm is equated to the four listed variational objects.

If this is right

  • Sanov decompositions and Gibbs conditioning follow as direct special cases.
  • Chernoff information and its multi-way version give hypothesis-testing error exponents.
  • Donsker-Varadhan and PAC-Bayes change-of-measure inequalities are recovered uniformly.
  • An exact multi-prior PAC-Bayes penalty subtracts an explicit coincidence bonus from the usual term.
  • The asymptotic MAP error exponent for W-ary testing appears as an edge-restricted simplex optimum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same identity may supply exact finite-sample versions of several bounds that are usually stated only asymptotically.
  • Contrastive decoding in language models and sliding-window separation of genomic priors illustrate how the calculus applies at scale to sequential data.
  • Other variational problems outside information theory that involve products of measures could admit analogous unifications.

Load-bearing premise

The algebraic identity holds exactly for arbitrary real exponents and for unnormalized or continuum-indexed priors.

What would settle it

A concrete counterexample computation, for chosen priors, exponents, and reference measure, in which the log mixed count fails to equal the claimed KL-barycenter optimum.

Figures

Figures reproduced from arXiv: 2606.25042 by Akshay Balsubramani.

Figure 1
Figure 1. Figure 1: Phase-transition heatmap on three benchmarks at W = 3, α = (2, 2, 2). (a) Synthetic three-prior Zipf bank (|V| = 104 , rank-1.2, σ = 0.1 logit noise). (b) gpt2 correlated Paris paraphrases (Paris_Corr_3; |V| = 50,257). (c) gpt2 diverse country-history templates (Paris_Div_3; |V| = 50,257). Color encodes empirical Pr[m-coincidence]; the dashed green curve is the theoretical threshold m = Ψ(α) log n. proxy c… view at source ↗
Figure 2
Figure 2. Figure 2: Two complementary instruments for the pooled partition function Z(α), the same target in both panels. (a) Certified two-sided bracket(this work) on the realized gpt2 priors (W = 3, α = (2, 2, 2); head-setS the union of per￾prior tops at size K). The certified relative width (U −L)/Ldescends monotonically to machine precision— below 10−6 by K ≈ 20 on both Paris_Corr_3 (correlated) and Paris_Div_3 (diverse, … view at source ↗
Figure 3
Figure 3. Figure 3: The mixed coincidence partition function recurs along trajectories. (a) Language: gpt2 autoregressive continua￾tion underW = 10 diverse country-history prompts; at each token position t we compute log Zt(1) = log P v Q i pi(v | prompti , y1:t−1). Markers are colored by part-of-speech class; the closed-class function tokens ( to reaches log Zt → 0, perfect 10-way consensus; the, in) sit above the open-class… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Absolute pooling gap J(¯p) − Φ(λ) separates the two regimes cleanly with no overlap. (b) The scale-free ratio PB(x) = (J(¯p) − Φ)/Φ overlaps considerably between regimes and is not a useful separator. (c) Internal energy E(λ) = P i λiH(p ∗ λ , pi). Both the absolute gap and E(λ) separate the regimes; PB does not. All experiments are fully specified by the model checkpoint, prompt sets (including the pe… view at source ↗
Figure 5
Figure 5. Figure 5: Certified top-K approximation. (a)–(b) Mean head fraction fα(K) for the two subset constructions (score_topk converges faster than union_topk). Bands are ± one standard deviation across 60 neighborhoods. (c)–(d) Boxplots of K0.95 and K0.99 split by regime (score_topk). Diverse neighborhoods require deeper tail engagement. ZS(α) := P v∈S Q i p αi i (v) ≤ Z(α). Let α¯ = P i αi and apply generalized Hölder wi… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Information collapse: average single-view log S2(pi) vs. pooled log S2(p ⋆ λ ). Correlated neighborhoods drop below the diagonal (concentration); diverse neighborhoods sit above (anti-concentration). (b) Distribution of the integer￾multiplicity gap ∆1,2 across regimes – positive in both because the W · log Sq(pi) term dominates. (c) Tail-engagement scatter: pooled log S2(p ⋆ λ ) vs. Kscore 0.95 across … view at source ↗
Figure 7
Figure 7. Figure 7: Model-scale ablation. (a) Per-neighborhood pooling-benefit ratio PB on GPT-2 (124M) vs. Pythia-160M – tightly correlated, with mild model-dependent rescaling. (b) Per-neighborhood pooled log S2(p ⋆ λ ) across models, again tightly correlated. The diagnostics are properties of the prior structure, not the checkpoint. by changing one identifier; we report Pythia-160M because it is the largest model that fits… view at source ↗
Figure 8
Figure 8. Figure 8: Per-regime pooling quantities on the 6-row gpt2 benchmark, grouped by W ∈ {3, 10, 100} and split by regime (correlated paraphrases versus diverse country-prompts). (a) The coincidence divergence Φ(λ); (b) the scale-free pooling￾benefit ratio PB = (J(¯p) − Φ)/Φ. (0.125) by more than 2×, so PB is not a clean single-feature separator at this benchmark. PB overlaps considerably between regimes and is not a use… view at source ↗
Figure 9
Figure 9. Figure 9: Coincidence phase-transition heatmap (gpt2, W = 12 diverse country-history priors, multiplicity α = (3, . . . , 3), top-100-token stop-word mask). Color encodes Pr[ m-coincidence ]; the dashed red curve is the empirical threshold m ≈ Ψ(α) log n with fitted Ψ ≈ 1.48. With W = 12, α = 3 the transition sweeps through the center of the displayed grid — no m-coincidence below and to the left, saturated above an… view at source ↗
Figure 10
Figure 10. Figure 10: Per-POS mean log Zt(1) on the realized W = 10 Paris_Div_10 continuation (T = 40), sorted by con￾sensus (0 = full 10-way agreement). The five function categories (PART, ADP, PUNCT, CCONJ, PRON) occupy the high￾consensus top; the content categories the low-consensus bottom. spaCy’s AUX tag here picks up modal/auxiliary verbs whose continuation is content-loaded (“has been”, “have become”), so it sits with t… view at source ↗
Figure 11
Figure 11. Figure 11: Closed-vs-open ROC-AUC of the two consensus features, averaged over W ∈ {3, 8} (the two values of W agree to within 0.005 AUC at every scale). The full multiplicative consensus log Zt(1) is a single-feature open/closed￾class sensor at every model scale (AUC 0.70–0.81, above the threshold), with closed-class positions carrying the higher consensus; the coincidence divergence Φt is the weaker feature (AUC 0… view at source ↗
Figure 12
Figure 12. Figure 12: Tightness ratio Bfull n /εbn versus block length n on each of the three datasets, with BCa 95% intervals from B = 200 estimation+sampling bootstrap resamples. The reference lines y = 1 (perfect tightness) and y = M (the union￾bound-by-classes ceiling) are shown for each dataset. The ratio stays well below M on all three datasets and across all six block lengths, confirming the bound is within a small-poly… view at source ↗
Figure 13
Figure 13. Figure 13: Three sequence-to-function models as independent priors, read as a genome browser at base-pair scale. Each locus block shows, top to bottom, the three models’ predicted tracks (AlphaGenome, Enformer, Borzoi) and their multi￾prior coincidence track Q m π c m(b) over the central 16 kb (128 bins of 128 bp), at the matched cell type (left) and a mis￾matched one (right). The light band marks the active element… view at source ↗
Figure 14
Figure 14. Figure 14: The multi-prior benefit at base-pair resolution, one panel per locus. In each panel the single predictor’s track at the matched cell type (light) is broad and multi-peaked (effective support 35/26/41 bins), while the multi-prior coincidence of the three predictors (dark) concentrates to the active element (7/4/17 bins). The coincidence of independent priors is what converts diffuse single-predictor signal… view at source ↗
Figure 15
Figure 15. Figure 15: Coincidence-bonus Cα/ minw D(ρkπw) across the four triple regimes. The within-task triples show smaller bonus magnitudes than the across-task triples; the regime-separation hypothesis is contradicted at the 95% level. The closed-form identity (61) is verified to machine precision on every triple. Stage-1 method. Simulate N = 4 cCRE-like sequence classes (background, PLS, pELS, dELS surrogates) via Dirichl… view at source ↗
Figure 16
Figure 16. Figure 16: Stage-1 boundary-detection ROC on the synthetic windowed-genome. The joint-alphabet AUC of 0.968 [0.959, 0.976] meets the pre-registered ≥ 0.85 target, but the sequence-only AUC saturates at 1.000 rather than staying below the pre-registered 0.65 uninformative floor; the synthetic-data overshoot motivates the stage-2 run on real hg38 to measure the true sequence-only floor. + ENCODE cCRE V4 + HepG2 DNase … view at source ↗
read the original abstract

We prove a single algebraic mixed coincidence identity that unifies a broad swath of information-theoretic variational results. For any family of priors $\{\pi_i\}$ and real exponents $\{ \alpha_i \}$, the log of the mixed count $E_{x\sim\nu}\!\left[\prod_{i=1}^W \pi_i^{\alpha_i}(x)\right]$ is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. The identity yields a unified derivation of classical cornerstones of information theory: concentration of empirical distributions (Sanov-type decompositions and Gibbs conditioning), hypothesis-testing error exponents (Chernoff information and its multi-way analogue), change-of-measure inequalities (Donsker-Varadhan and PAC-Bayes), and laws governing rare-pattern coincidences (Erdos-Renyi run-length, iterative guesswork, rate-distortion, and birthday thresholds). Each is recovered as a specialization of the same algebraic equality. It strictly generalizes the classical Renyi entropy and divergence variational formulas (one and two priors respectively) to a $W$-prior simplex, and holds for unnormalized and continuum-indexed priors. Among its consequences are an exact multi-prior PAC-Bayes penalty that subtracts an explicit "coincidence bonus" from the usual single-prior posterior penalty, and the asymptotic MAP error exponent for $W$-ary hypothesis testing as an edge-restricted simplex optimum. We demonstrate the calculus at scale on two large alphabets encoding richly modeled sequential languages: on language-model next-token predictives where we recover contrastive decoding, and on human genomic regulatory sequence where it separates correlated from diverse prior families along a sliding-window trace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to prove a single algebraic mixed coincidence identity: for any family of priors {π_i} and real exponents {α_i}, log E_{x∼ν}[∏_{i=1}^W π_i^{α_i}(x)] simultaneously equals a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. This identity is asserted to yield unified derivations of Sanov-type results, Chernoff information, Donsker-Varadhan and PAC-Bayes inequalities, Renyi generalizations to W priors, and several rare-pattern laws, while holding for unnormalized and continuum-indexed priors. Applications include an exact multi-prior PAC-Bayes penalty and demonstrations on language-model next-token prediction and genomic sequences.

Significance. If the central algebraic identity is rigorously established and holds over the claimed domain (arbitrary real α_i, unnormalized priors), the work would provide a notable unifying algebraic framework for many classical variational results in information theory. The explicit generalization of Renyi formulas to a W-prior simplex and the derived multi-prior PAC-Bayes form with a coincidence bonus would be useful contributions; the large-alphabet empirical examples add concrete illustration.

major comments (2)
  1. [abstract / main identity] Abstract and statement of the main identity: the manuscript asserts that the algebraic identity has been proved and that every listed result follows directly from it, yet supplies no expansion steps, lemmas, or explicit algebraic manipulations showing how log E[∏ π_i^{α_i}(x)] equals the KL-barycenter optimum or the exponential-family normalizer. This absence is load-bearing for the unification claim.
  2. [abstract] Abstract: the identity is claimed to hold for arbitrary real (including negative) α_i and for unnormalized or continuum-indexed priors, but no verification, domain restrictions, or handling of cases where ∏ π_i^{α_i}(x) becomes undefined or infinite (e.g., when some α_i < 0 and some π_i(x) = 0) is provided. Such cases directly affect whether the claimed interchange with Donsker-Varadhan or max-ent forms remains valid for the Sanov and Chernoff specializations.
minor comments (1)
  1. [abstract] Notation for the mixed count uses an implicit measure ν without explicit definition in the abstract; a brief clarification of the reference measure would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying points where the presentation of the central identity can be strengthened. We agree that additional explicit steps and domain clarifications will improve accessibility and rigor. Below we respond point-by-point to the major comments and commit to revisions that address them directly.

read point-by-point responses
  1. Referee: [abstract / main identity] Abstract and statement of the main identity: the manuscript asserts that the algebraic identity has been proved and that every listed result follows directly from it, yet supplies no expansion steps, lemmas, or explicit algebraic manipulations showing how log E[∏ π_i^{α_i}(x)] equals the KL-barycenter optimum or the exponential-family normalizer. This absence is load-bearing for the unification claim.

    Authors: We accept that the current compact statement of the identity, while algebraically direct from the definition of the mixed expectation, does not include intermediate lemmas or step-by-step expansions. The equivalences to the KL-barycenter optimum and exponential-family normalizer follow from standard convex duality and the definition of the log-moment generating function, but these connections were left implicit. In revision we will insert a new subsection (after the statement of the identity) containing two short lemmas: one deriving the KL-barycenter representation via the variational definition of KL divergence, and one recovering the exponential-family normalizer via the cumulant function. These lemmas will contain the explicit algebraic manipulations requested, making the unification claim self-contained. revision: yes

  2. Referee: [abstract] Abstract: the identity is claimed to hold for arbitrary real (including negative) α_i and for unnormalized or continuum-indexed priors, but no verification, domain restrictions, or handling of cases where ∏ π_i^{α_i}(x) becomes undefined or infinite (e.g., when some α_i < 0 and some π_i(x) = 0) is provided. Such cases directly affect whether the claimed interchange with Donsker-Varadhan or max-ent forms remains valid for the Sanov and Chernoff specializations.

    Authors: The manuscript asserts the identity for unnormalized and continuum-indexed priors, yet we agree that the domain statement is insufficiently precise. Negative exponents require that the support of each π_i be respected to keep the product finite and positive. In the revision we will add an explicit domain paragraph stating that the identity holds when either (i) all α_i ≥ 0, or (ii) α_i < 0 only for those i where π_i(x) > 0 almost everywhere under ν, with the convention 0^0 := 1 for measure-zero sets. We will also verify that the Sanov and Chernoff specializations remain valid under these restrictions because the classical statements already impose the necessary support conditions on the empirical measures. A short remark will note that the Donsker-Varadhan and max-ent interchanges continue to hold on this restricted domain. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim is a direct algebraic identity

full rationale

The paper states it proves an algebraic mixed coincidence identity by direct means, with the log-expectation of the product of powered priors serving as the unifying object that specializes to known variational results. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the derivation is presented as rearrangement and specialization of the identity itself. The paper is self-contained against external benchmarks for the algebraic step, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract alone; the paper rests on the truth of the stated algebraic identity and on the ability to specialize it to the listed classical results. No free parameters, invented entities, or additional axioms are mentioned.

axioms (1)
  • ad hoc to paper The mixed coincidence identity holds for arbitrary real exponents and for unnormalized or continuum-indexed priors
    This equality is the load-bearing statement whose proof is claimed in the abstract.

pith-pipeline@v0.9.1-grok · 5845 in / 1426 out tokens · 37804 ms · 2026-06-25T21:57:04.128797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. All you need is log

    cs.IT 2026-06 unverdicted novelty 8.0

    The unique family of multi-distribution Rényi functionals is the positive integral of coincidence divergences C_α over the simplex interior, mixed-sign cones, tropical boundary, and KL edges.

Reference graph

Works this paper leans on

105 extracted references · 10 canonical work pages · cited by 1 Pith paper

  1. [1]

    Brownian excursions, critical random graphs and the multiplicative coalescent

    David Aldous. Brownian excursions, critical random graphs and the multiplicative coalescent. The Annals of Proba- bility, pages 812–854, 1997

  2. [2]

    A variational characterization of rényi divergences

    Venkat Anantharam. A variational characterization of rényi divergences. IEEE Transactions on Information Theory, 64(11):6979–6989, 2018

  3. [3]

    An inequality on guessing and its application to sequential decoding

    Erdal Arikan. An inequality on guessing and its application to sequential decoding. IEEE Transactions on Information Theory, 42(1):99–105, 2002

  4. [4]

    The multiplicative weights update method: a meta-algorithm and appli- cations

    Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and appli- cations. Theory of computing, 8(1):121–164, 2012

  5. [5]

    Projection theorems for the rényi divergence on α-convex sets

    M Ashok Kumar and Igal Sason. Projection theorems for the rényi divergence on α-convex sets. IEEE Transactions on Information Theory, 62(9):4924–4935, 2016

  6. [6]

    Cramér–rao lower bounds arising from generalized csiszár divergences

    M Ashok Kumar and Kumar Vijay Mishra. Cramér–rao lower bounds arising from generalized csiszár divergences. Information Geometry, 3(1):33–59, 2020

  7. [7]

    Robust bounds on risk-sensitive functionals via rényi divergence

    Rami Atar, Kenny Chowdhary, and Paul Dupuis. Robust bounds on risk-sensitive functionals via rényi divergence. SIAM/ASA Journal on Uncertainty Quantification, 3(1):18–33, 2015

  8. [8]

    A Modification of the Sequential Probability Ratio Test to Reduce the Sample Size

    Raghu Raj Bahadur and R. Ranga Rao. On deviations of the sample mean. The Annals of Mathematical Statistics , 31 (4):1015–1027, 1960. doi: 10.1214/aoms/1177705694

  9. [9]

    Adaptive sampling for efficient softmax approximation

    Tavor Z Baharav, Daniel Kang, Colin Sullivan, Mo Tiwari, Eric Luxenberg, David Tse, and Mert Pilanci. Adaptive sampling for efficient softmax approximation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  10. [10]

    Sharp finite-sample concentration of independent variables

    Akshay Balsubramani. Sharp finite-sample concentration of independent variables. arXiv preprint arXiv:2008.13293, 2020

  11. [11]

    Entropy, concentration, and learning: a statistical mechanics primer

    Akshay Balsubramani. Entropy, concentration, and learning: a statistical mechanics primer. arXiv preprint arXiv:2409.18630, 2024

  12. [12]

    Introduction to smooth ergodic theory , volume 231

    Luis Barreira and Yakov Pesin. Introduction to smooth ergodic theory , volume 231. American Mathematical Society, 2023

  13. [13]

    Upper and lower bounds on the renyi dimensions and the uniformity of multifractals

    Christian Beck. Upper and lower bounds on the renyi dimensions and the uniformity of multifractals. Physica D: Nonlinear Phenomena, 41(1):67–78, 1990

  14. [14]

    On a measure of divergence between two statistical populations defined by their probability distribution

    Anil Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society , 35:99–110, 1943

  15. [15]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning (ICML), 2023

  16. [16]

    Variational representations and neural network estimation of rényi divergences

    Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Luc Rey-Bellet, and Jie Wang. Variational representations and neural network estimation of rényi divergences. SIAM Journal on Mathematics of Data Science, 3(4):1093–1116, 2021

  17. [17]

    Variational inference: A review for statisticians

    David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017

  18. [18]

    Concentration Inequalities: A Nonasymptotic Theory of Inde- pendence

    Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Inde- pendence. Oxford University Press, 2013

  19. [19]

    On rényi entropies and their applications to guessing attacks in cryptography

    Serdar Boztas. On rényi entropies and their applications to guessing attacks in cryptography. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences , 97(12):2542–2548, 2014. 28

  20. [20]

    Conditional Rényi divergence saddlepoint and the maximization of α-mutual informa- tion

    Cong Cai and Sergio Verdú. Conditional Rényi divergence saddlepoint and the maximization of α-mutual informa- tion. Entropy, 21(10):969, 2019. doi: 10.3390/e21100969

  21. [21]

    A coding theorem and rényi’s entropy

    L Lorne Campbell. A coding theorem and rényi’s entropy. Information and control, 8(4):423–429, 1965

  22. [22]

    Definition of entropy by means of a coding problem

    LL Campbell. Definition of entropy by means of a coding problem. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 6(2):113–118, 1966

  23. [23]

    Eugenio Clerico, Tyler Farghly, George Deligiannidis, Benjamin Guedj, and Arnaud Doucet

    Olivier Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning , volume 56 of Institute of Mathematical Statistics Lecture Notes–Monograph Series . Institute of Mathematical Statistics, Beachwood, OH, 2007. doi: 10.1214/074921707000000391

  24. [24]

    Micro-canonical cascades and random homeomorphisms

    Xinxin Chen, Yong Han, Yanqi Qiu, and Zipeng Wang. Micro-canonical cascades and random homeomorphisms. arXiv preprint arXiv:2505.16405, 2025

  25. [25]

    A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, pages 493–507, 1952

    Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, pages 493–507, 1952

  26. [26]

    Multifractal formalism derived from thermodynamics for general dynamical systems.Electronic Research Announcements, 17:1–11, 2010

    Vaughn Climenhaga. Multifractal formalism derived from thermodynamics for general dynamical systems.Electronic Research Announcements, 17:1–11, 2010

  27. [27]

    Universal randomized guessing subject to distortion

    Asaf Cohen and Neri Merhav. Universal randomized guessing subject to distortion. IEEE Transactions on Information Theory, 68(12):7714–7734, 2022

  28. [28]

    I. Csiszár. I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3 (1):146–158, 1975

  29. [29]

    I. Csiszár. Sanov property, generalized I-projection and a conditional limit theorem. Annals of Probability , 12(3): 768–793, 1984

  30. [30]

    Csiszár and F

    I. Csiszár and F. Matus. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, June 2003. ISSN 0018-9448

  31. [31]

    Generalized cutoff rates and rényi’s information measures

    Imre Csiszár. Generalized cutoff rates and rényi’s information measures. IEEE Transactions on information theory , 41(1):26–34, 2002

  32. [32]

    Information theory and statistics: A tutorial

    Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Commu- nications and Information Theory, 1(4):417–528, 2004

  33. [33]

    H. E. Daniels. Saddlepoint approximations in statistics. The Annals of Mathematical Statistics , 25(4):631–650, 1954. doi: 10.1214/aoms/1177728652

  34. [34]

    Large deviations techniques and applications

    Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications. Stochastic Modelling and Applied Probability, 2010

  35. [35]

    Large deviations for a general class of random vectors

    Richard S Ellis. Large deviations for a general class of random vectors. The Annals of Probability, 12(1):1–12, 1984

  36. [36]

    On a new law of large numbers

    Paul Erdös and Alfred Rényi. On a new law of large numbers. Journal d’Analyse Mathématique, 23(1):103–111, 1970

  37. [37]

    Detecting hallucinations in large language models using semantic entropy

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630:625–630, 2024

  38. [38]

    Adaptive game playing using multiplicative weights

    Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999

  39. [39]

    Laurent, Anqi Shao, Maria Del Mar Alvarez-T orres, Tianji Yu, Jimin Tan, Jiayu Su, Romella Sagatelian, Adolfo A

    Xi Fu, Shentong Mo, Alejandro Buendia, Anouchka P. Laurent, Anqi Shao, Maria Del Mar Alvarez-T orres, Tianji Yu, Jimin Tan, Jiayu Su, Romella Sagatelian, Adolfo A. Ferrando, Alberto Ciccia, Yanyan Lan, David M. Owens, T eresa Palomero, Eric P. Xing, and Raul Rabadan. A foundation model of transcription across human cell types. Nature, 637(8047):965–973, 2...

  40. [40]

    On large deviations from the invariant measure

    Jürgen Gärtner. On large deviations from the invariant measure. Theory of Probability & Its Applications, 22(1):24–39, 1977

  41. [41]

    A characterization theorem for externally bayesian groups

    Christian Genest. A characterization theorem for externally bayesian groups. The Annals of Statistics , 12(3):1100– 1105, 1984

  42. [42]

    Combining probability distributions: A critique and an annotated bibliography

    Christian Genest and James V Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1(1):114–135, 1986

  43. [43]

    continuity

    Robert B Griffiths and David Ruelle. Strict convexity (“continuity”) of the pressure in lattice systems. Communications in Mathematical Physics, 23(3):169–175, 1971

  44. [44]

    The minimum description length principle

    Peter D Grünwald. The minimum description length principle. MIT press, 2007

  45. [45]

    Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory

    Peter D Grünwald and A Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. The Annals of Statistics, 32(4):1367–1433, 2004

  46. [46]

    Regularized rényi divergence minimization through breg- man proximal gradient algorithms

    Thomas Guilmeau, Emilie Chouzenoux, and Víctor Elvira. Regularized rényi divergence minimization through breg- man proximal gradient algorithms. Journal of Machine Learning Research, 26(157):1–56, 2025

  47. [47]

    Training products of experts by minimizing contrastive divergence

    Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8): 1771–1800, 2002

  48. [48]

    spaCy: Industrial-strength natural language processing in python, 2020

    Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-strength natural language processing in python, 2020. https://spacy.io

  49. [49]

    Justification of logarithmic loss via the benefit of side information

    Jiantao Jiao, Thomas A Courtade, Kartik Venkat, and Tsachy Weissman. Justification of logarithmic loss via the benefit of side information. IEEE Transactions on Information Theory, 61(10):5357–5365, 2015

  50. [50]

    Axiomatic characterization of the directed divergences and their linear combinations

    R Johnson. Axiomatic characterization of the directed divergences and their linear combinations. IEEE Transactions on Information Theory, 25(6):709–716, 1979

  51. [51]

    Positive martingales and random measures

    Jean-Pierre Kahane. Positive martingales and random measures. Chinese Annals of Mathematics Series B , 8(1):1–12, 1987

  52. [52]

    Auto-encoding variational bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  53. [53]

    Fusion of probability density functions

    Günther Koliander, Yousef El-Laham, Petar M Djurić, and Franz Hlawatsch. Fusion of probability density functions. Proceedings of the IEEE, 110(4):404–453, 2022

  54. [54]

    An axiomatic theory of fairness in network resource allocation

    Tian Lan, David Kao, Mung Chiang, and Ashutosh Sabharwal. An axiomatic theory of fairness in network resource allocation. In Proceedings of the 29th conference on Information communications , pages 1343–1351, 2010

  55. [55]

    Mixture of experts meets prompt-based continual learning.Advances in Neural Information Processing Systems, 37:119025–119062, 2024

    Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo, and Nhat Ho. Mixture of experts meets prompt-based continual learning.Advances in Neural Information Processing Systems, 37:119025–119062, 2024

  56. [56]

    Contrastive decoding: Open-ended text generation as optimization

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Mike Lewis, and Luke Zettlemoyer. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , 2023. arXiv:2210.15097

  57. [57]

    On divergences and informations in statistics and information theory

    Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Trans- actions on Information Theory, 52(10):4394–4412, 2006

  58. [58]

    Divergence measures based on the shannon entropy

    Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory , 37(1): 145–151, 1991

  59. [59]

    Saddle point approximation for the distribution of the sum of independent random variables

    Robert Lugannani and Stephen Rice. Saddle point approximation for the distribution of the sum of independent random variables. Advances in Applied Probability, 12(2):475–490, 1980. doi: 10.2307/1426607

  60. [60]

    Algorithmic fractal dimensions in geometric measure theory

    Jack H Lutz and Elvira Mayordomo. Algorithmic fractal dimensions in geometric measure theory. In Handbook of Computability and Complexity in Analysis , pages 271–302. Springer, 2021. 30

  61. [61]

    Martyn, Michael T

    Gabriella E. Martyn, Michael T. Montgomery, Hank Jones, Katherine Guo, Benjamin R. Doughty, Johannes Lin- der, Deepa Bisht, Fan Xia, Xiangmeng S. Cai, Ziwei Chen, Kelly Cochran, Kathryn A. Lawrence, Glen Munson, Anusri Pampari, Charles P. Fulco, Nidhi Sahni, David R. Kelley, Eric S. Lander, Anshul Kundaje, and Jesse M. En- greitz. Rewriting regulatory DNA...

  62. [62]

    On the notion of affinity of several distributions and some of its applications

    Kameo Matusita. On the notion of affinity of several distributions and some of its applications. Annals of the Institute of Statistical Mathematics, 19(1):181–192, 1967

  63. [63]

    Some pac-bayesian theorems

    David A McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

  64. [64]

    Generalized q-dimensions of measures on non-autonomous conformal sets

    Jun Jie Miao and Tianrui Wang. Generalized q-dimensions of measures on non-autonomous conformal sets. arXiv preprint arXiv:2512.19771, 2025

  65. [65]

    Fair end-to-end window-based congestion control

    Jeonghoon Mo and Jean Walrand. Fair end-to-end window-based congestion control. IEEE/ACM Transactions on networking, 8(5):556–567, 2002

  66. [66]

    From blackwell dominance in large samples to rényi divergences and back again

    Xiaosheng Mu, Luciano Pomatto, Philipp Strack, and Omer Tamuz. From blackwell dominance in large samples to rényi divergences and back again. Econometrica, 89(1):475–506, 2021

  67. [67]

    An information-geometric characterization of chernoff information

    Frank Nielsen. An information-geometric characterization of chernoff information. IEEE Signal Processing Letters , 20(3):269–272, 2013

  68. [68]

    Kernel language entropy: Fine-grained uncer- tainty quantification for LLMs

    Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncer- tainty quantification for LLMs. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  69. [69]

    Perspective on physical interpretations of rényi entropy in statistical mechanics

    Misaki Ozawa and Nina Javerzat. Perspective on physical interpretations of rényi entropy in statistical mechanics. Europhysics Letters, 147(1):11001, 2024

  70. [70]

    Characteristic lyapunov exponents and smooth ergodic theory

    Ya B Pesin. Characteristic lyapunov exponents and smooth ergodic theory. Russian Mathematical Surveys, 32(4):55, 1977

  71. [71]

    Repulsive mixtures

    Francesca Petralia, Vinayak Rao, and David Dunson. Repulsive mixtures. Advances in neural information processing systems, 25, 2012

  72. [72]

    Information theory: From coding to learning

    Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning . Cambridge university press, 2025

  73. [73]

    Poincar n’e on gibbs and on probability in statistical mechanics.arXiv preprint arXiv:2505.12168, 2025

    Bruce D Popp. Poincar n’e on gibbs and on probability in statistical mechanics.arXiv preprint arXiv:2505.12168, 2025

  74. [74]

    Language models are unsu- pervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsu- pervised multitask learners. OpenAI technical report, 2019

  75. [75]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  76. [76]

    On the dimension and entropy of probability distributions

    Alfréd Rényi. On the dimension and entropy of probability distributions. Acta Mathematica Academiae Scientiarum Hungarica, 10(1):193–215, 1959

  77. [77]

    Dimension, entropy and information

    Alfréd Rényi. Dimension, entropy and information. In Trans. 2nd Prague Conf. Information Theory, pages 545–556, 1960

  78. [78]

    On measures of entropy and information

    Alfréd Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathemat- ical statistics and probability, volume 1: contributions to the theory of statistics , volume 4, pages 547–562. University of California Press, 1961

  79. [79]

    On the foundations of information theory

    Alfréd Rényi. On the foundations of information theory. Revue de l’Institut International de Statistique , pages 1–14, 1965. 31

  80. [80]

    On the probability of large deviations of random magnitudes

    Ivan Nikolaevich Sanov. On the probability of large deviations of random magnitudes. Matematicheskii Sbornik, 84 (1):11–44, 1957. Translation at https://repository.lib.ncsu.edu/items/8f909775-ba1b-4874-acc2-362a8221edb0

Showing first 80 references.