Information from coincidences

Akshay Balsubramani

arxiv: 2606.25042 · v1 · pith:N5637P2Snew · submitted 2026-06-23 · 💻 cs.IT · math.IT· math.PR· math.ST· stat.ML· stat.TH

Information from coincidences

Akshay Balsubramani This is my paper

Pith reviewed 2026-06-25 21:57 UTC · model grok-4.3

classification 💻 cs.IT math.ITmath.PRmath.STstat.MLstat.TH

keywords information theoryvariational methodscoincidence identityRenyi entropyPAC-BayesChernoff informationSanov theoremhypothesis testing

0 comments

The pith

One algebraic identity shows the log of a mixed prior count equals a Boltzmann weight, normalizer, maximum-entropy value, and KL-barycenter optimum at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a single algebraic mixed coincidence identity for any family of priors and real exponents. The log of the expectation of their powered product is shown to equal four different classical objects used in variational information theory. If correct, this identity supplies a common derivation for results on empirical distribution concentration, hypothesis testing exponents, change-of-measure inequalities, and rare coincidence thresholds, while extending the usual one- and two-prior Renyi formulas to any number of priors and to unnormalized or continuum cases.

Core claim

The central claim is that log E_{x∼ν}[∏_{i=1}^W π_i^{α_i}(x)] is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. Specializing the same equality recovers Sanov-type decompositions, Chernoff information, Donsker-Varadhan and PAC-Bayes inequalities, Erdos-Renyi run lengths, rate-distortion thresholds, and birthday problems. The identity generalizes the classical Renyi variational formulas to a W-prior simplex and holds for unnormalized and continuum-indexed priors.

What carries the argument

The mixed count E_{x∼ν}[∏_{i=1}^W π_i^{α_i}(x)] whose logarithm is equated to the four listed variational objects.

If this is right

Sanov decompositions and Gibbs conditioning follow as direct special cases.
Chernoff information and its multi-way version give hypothesis-testing error exponents.
Donsker-Varadhan and PAC-Bayes change-of-measure inequalities are recovered uniformly.
An exact multi-prior PAC-Bayes penalty subtracts an explicit coincidence bonus from the usual term.
The asymptotic MAP error exponent for W-ary testing appears as an edge-restricted simplex optimum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same identity may supply exact finite-sample versions of several bounds that are usually stated only asymptotically.
Contrastive decoding in language models and sliding-window separation of genomic priors illustrate how the calculus applies at scale to sequential data.
Other variational problems outside information theory that involve products of measures could admit analogous unifications.

Load-bearing premise

The algebraic identity holds exactly for arbitrary real exponents and for unnormalized or continuum-indexed priors.

What would settle it

A concrete counterexample computation, for chosen priors, exponents, and reference measure, in which the log mixed count fails to equal the claimed KL-barycenter optimum.

Figures

Figures reproduced from arXiv: 2606.25042 by Akshay Balsubramani.

**Figure 1.** Figure 1: Phase-transition heatmap on three benchmarks at W = 3, α = (2, 2, 2). (a) Synthetic three-prior Zipf bank (|V| = 104 , rank-1.2, σ = 0.1 logit noise). (b) gpt2 correlated Paris paraphrases (Paris_Corr_3; |V| = 50,257). (c) gpt2 diverse country-history templates (Paris_Div_3; |V| = 50,257). Color encodes empirical Pr[m-coincidence]; the dashed green curve is the theoretical threshold m = Ψ(α) log n. proxy c… view at source ↗

**Figure 2.** Figure 2: Two complementary instruments for the pooled partition function Z(α), the same target in both panels. (a) Certified two-sided bracket(this work) on the realized gpt2 priors (W = 3, α = (2, 2, 2); head-setS the union of perprior tops at size K). The certified relative width (U −L)/Ldescends monotonically to machine precision— below 10−6 by K ≈ 20 on both Paris_Corr_3 (correlated) and Paris_Div_3 (diverse, … view at source ↗

**Figure 3.** Figure 3: The mixed coincidence partition function recurs along trajectories. (a) Language: gpt2 autoregressive continuation underW = 10 diverse country-history prompts; at each token position t we compute log Zt(1) = log P v Q i pi(v | prompti , y1:t−1). Markers are colored by part-of-speech class; the closed-class function tokens ( to reaches log Zt → 0, perfect 10-way consensus; the, in) sit above the open-class… view at source ↗

**Figure 4.** Figure 4: (a) Absolute pooling gap J(¯p) − Φ(λ) separates the two regimes cleanly with no overlap. (b) The scale-free ratio PB(x) = (J(¯p) − Φ)/Φ overlaps considerably between regimes and is not a useful separator. (c) Internal energy E(λ) = P i λiH(p ∗ λ , pi). Both the absolute gap and E(λ) separate the regimes; PB does not. All experiments are fully specified by the model checkpoint, prompt sets (including the pe… view at source ↗

**Figure 5.** Figure 5: Certified top-K approximation. (a)–(b) Mean head fraction fα(K) for the two subset constructions (score_topk converges faster than union_topk). Bands are ± one standard deviation across 60 neighborhoods. (c)–(d) Boxplots of K0.95 and K0.99 split by regime (score_topk). Diverse neighborhoods require deeper tail engagement. ZS(α) := P v∈S Q i p αi i (v) ≤ Z(α). Let α¯ = P i αi and apply generalized Hölder wi… view at source ↗

**Figure 6.** Figure 6: (a) Information collapse: average single-view log S2(pi) vs. pooled log S2(p ⋆ λ ). Correlated neighborhoods drop below the diagonal (concentration); diverse neighborhoods sit above (anti-concentration). (b) Distribution of the integermultiplicity gap ∆1,2 across regimes – positive in both because the W · log Sq(pi) term dominates. (c) Tail-engagement scatter: pooled log S2(p ⋆ λ ) vs. Kscore 0.95 across … view at source ↗

**Figure 7.** Figure 7: Model-scale ablation. (a) Per-neighborhood pooling-benefit ratio PB on GPT-2 (124M) vs. Pythia-160M – tightly correlated, with mild model-dependent rescaling. (b) Per-neighborhood pooled log S2(p ⋆ λ ) across models, again tightly correlated. The diagnostics are properties of the prior structure, not the checkpoint. by changing one identifier; we report Pythia-160M because it is the largest model that fits… view at source ↗

**Figure 8.** Figure 8: Per-regime pooling quantities on the 6-row gpt2 benchmark, grouped by W ∈ {3, 10, 100} and split by regime (correlated paraphrases versus diverse country-prompts). (a) The coincidence divergence Φ(λ); (b) the scale-free poolingbenefit ratio PB = (J(¯p) − Φ)/Φ. (0.125) by more than 2×, so PB is not a clean single-feature separator at this benchmark. PB overlaps considerably between regimes and is not a use… view at source ↗

**Figure 9.** Figure 9: Coincidence phase-transition heatmap (gpt2, W = 12 diverse country-history priors, multiplicity α = (3, . . . , 3), top-100-token stop-word mask). Color encodes Pr[ m-coincidence ]; the dashed red curve is the empirical threshold m ≈ Ψ(α) log n with fitted Ψ ≈ 1.48. With W = 12, α = 3 the transition sweeps through the center of the displayed grid — no m-coincidence below and to the left, saturated above an… view at source ↗

**Figure 10.** Figure 10: Per-POS mean log Zt(1) on the realized W = 10 Paris_Div_10 continuation (T = 40), sorted by consensus (0 = full 10-way agreement). The five function categories (PART, ADP, PUNCT, CCONJ, PRON) occupy the highconsensus top; the content categories the low-consensus bottom. spaCy’s AUX tag here picks up modal/auxiliary verbs whose continuation is content-loaded (“has been”, “have become”), so it sits with t… view at source ↗

**Figure 11.** Figure 11: Closed-vs-open ROC-AUC of the two consensus features, averaged over W ∈ {3, 8} (the two values of W agree to within 0.005 AUC at every scale). The full multiplicative consensus log Zt(1) is a single-feature open/closedclass sensor at every model scale (AUC 0.70–0.81, above the threshold), with closed-class positions carrying the higher consensus; the coincidence divergence Φt is the weaker feature (AUC 0… view at source ↗

**Figure 12.** Figure 12: Tightness ratio Bfull n /εbn versus block length n on each of the three datasets, with BCa 95% intervals from B = 200 estimation+sampling bootstrap resamples. The reference lines y = 1 (perfect tightness) and y = M (the unionbound-by-classes ceiling) are shown for each dataset. The ratio stays well below M on all three datasets and across all six block lengths, confirming the bound is within a small-poly… view at source ↗

**Figure 13.** Figure 13: Three sequence-to-function models as independent priors, read as a genome browser at base-pair scale. Each locus block shows, top to bottom, the three models’ predicted tracks (AlphaGenome, Enformer, Borzoi) and their multiprior coincidence track Q m π c m(b) over the central 16 kb (128 bins of 128 bp), at the matched cell type (left) and a mismatched one (right). The light band marks the active element… view at source ↗

**Figure 14.** Figure 14: The multi-prior benefit at base-pair resolution, one panel per locus. In each panel the single predictor’s track at the matched cell type (light) is broad and multi-peaked (effective support 35/26/41 bins), while the multi-prior coincidence of the three predictors (dark) concentrates to the active element (7/4/17 bins). The coincidence of independent priors is what converts diffuse single-predictor signal… view at source ↗

**Figure 15.** Figure 15: Coincidence-bonus Cα/ minw D(ρkπw) across the four triple regimes. The within-task triples show smaller bonus magnitudes than the across-task triples; the regime-separation hypothesis is contradicted at the 95% level. The closed-form identity (61) is verified to machine precision on every triple. Stage-1 method. Simulate N = 4 cCRE-like sequence classes (background, PLS, pELS, dELS surrogates) via Dirichl… view at source ↗

**Figure 16.** Figure 16: Stage-1 boundary-detection ROC on the synthetic windowed-genome. The joint-alphabet AUC of 0.968 [0.959, 0.976] meets the pre-registered ≥ 0.85 target, but the sequence-only AUC saturates at 1.000 rather than staying below the pre-registered 0.65 uninformative floor; the synthetic-data overshoot motivates the stage-2 run on real hg38 to measure the true sequence-only floor. + ENCODE cCRE V4 + HepG2 DNase … view at source ↗

read the original abstract

We prove a single algebraic mixed coincidence identity that unifies a broad swath of information-theoretic variational results. For any family of priors $\{\pi_i\}$ and real exponents $\{ \alpha_i \}$, the log of the mixed count $E_{x\sim\nu}\!\left[\prod_{i=1}^W \pi_i^{\alpha_i}(x)\right]$ is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. The identity yields a unified derivation of classical cornerstones of information theory: concentration of empirical distributions (Sanov-type decompositions and Gibbs conditioning), hypothesis-testing error exponents (Chernoff information and its multi-way analogue), change-of-measure inequalities (Donsker-Varadhan and PAC-Bayes), and laws governing rare-pattern coincidences (Erdos-Renyi run-length, iterative guesswork, rate-distortion, and birthday thresholds). Each is recovered as a specialization of the same algebraic equality. It strictly generalizes the classical Renyi entropy and divergence variational formulas (one and two priors respectively) to a $W$-prior simplex, and holds for unnormalized and continuum-indexed priors. Among its consequences are an exact multi-prior PAC-Bayes penalty that subtracts an explicit "coincidence bonus" from the usual single-prior posterior penalty, and the asymptotic MAP error exponent for $W$-ary hypothesis testing as an edge-restricted simplex optimum. We demonstrate the calculus at scale on two large alphabets encoding richly modeled sequential languages: on language-model next-token predictives where we recover contrastive decoding, and on human genomic regulatory sequence where it separates correlated from diverse prior families along a sliding-window trace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core is a claimed algebraic identity unifying Renyi formulas and several classical results under a multi-prior mixed count, but the extension to negative alphas and unnormalized priors looks shaky without extra conditions.

read the letter

The paper's main contribution is an algebraic identity saying that the log of E[product of powered priors] equals a Boltzmann weight, exponential-family normalizer, max-ent value, and KL-barycenter optimum all at once. This is positioned as a direct generalization of the one- and two-prior Renyi variational formulas to W priors, with Sanov decompositions, Chernoff information, Donsker-Varadhan, PAC-Bayes penalties, and several coincidence laws falling out as special cases.

It does recover the classical one- and two-prior formulas cleanly and shows how the same expression yields an explicit multi-prior PAC-Bayes penalty that subtracts a coincidence bonus. The applications to language-model next-token prediction (recovering contrastive decoding) and sliding-window analysis of genomic sequences are concrete and use real-scale alphabets, which is better than pure theory.

The soft spot is the stated domain. The identity is asserted for arbitrary real alphas (including negatives) and unnormalized or continuum priors. When some alpha_i is negative the product can hit zero or infinity depending on the support of the priors, and the claimed interchange with max-ent or Donsker-Varadhan forms is not automatic. The stress-test concern lands here; if the paper only checks positive alphas or normalized cases, the unification does not cover the full range advertised. Edge-case verification and any required regularity conditions are the parts that need to be explicit.

This is for readers already working on variational characterizations and multi-prior extensions. Someone looking for a single algebraic roof over Sanov-Chernoff-PAC-Bayes territory would find it worth checking, but they will have to verify the algebra and the boundary cases themselves.

Send it to peer review. The unification claim is large enough to justify referee time even if the domain needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims to prove a single algebraic mixed coincidence identity: for any family of priors {π_i} and real exponents {α_i}, log E_{x∼ν}[∏_{i=1}^W π_i^{α_i}(x)] simultaneously equals a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. This identity is asserted to yield unified derivations of Sanov-type results, Chernoff information, Donsker-Varadhan and PAC-Bayes inequalities, Renyi generalizations to W priors, and several rare-pattern laws, while holding for unnormalized and continuum-indexed priors. Applications include an exact multi-prior PAC-Bayes penalty and demonstrations on language-model next-token prediction and genomic sequences.

Significance. If the central algebraic identity is rigorously established and holds over the claimed domain (arbitrary real α_i, unnormalized priors), the work would provide a notable unifying algebraic framework for many classical variational results in information theory. The explicit generalization of Renyi formulas to a W-prior simplex and the derived multi-prior PAC-Bayes form with a coincidence bonus would be useful contributions; the large-alphabet empirical examples add concrete illustration.

major comments (2)

[abstract / main identity] Abstract and statement of the main identity: the manuscript asserts that the algebraic identity has been proved and that every listed result follows directly from it, yet supplies no expansion steps, lemmas, or explicit algebraic manipulations showing how log E[∏ π_i^{α_i}(x)] equals the KL-barycenter optimum or the exponential-family normalizer. This absence is load-bearing for the unification claim.
[abstract] Abstract: the identity is claimed to hold for arbitrary real (including negative) α_i and for unnormalized or continuum-indexed priors, but no verification, domain restrictions, or handling of cases where ∏ π_i^{α_i}(x) becomes undefined or infinite (e.g., when some α_i < 0 and some π_i(x) = 0) is provided. Such cases directly affect whether the claimed interchange with Donsker-Varadhan or max-ent forms remains valid for the Sanov and Chernoff specializations.

minor comments (1)

[abstract] Notation for the mixed count uses an implicit measure ν without explicit definition in the abstract; a brief clarification of the reference measure would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying points where the presentation of the central identity can be strengthened. We agree that additional explicit steps and domain clarifications will improve accessibility and rigor. Below we respond point-by-point to the major comments and commit to revisions that address them directly.

read point-by-point responses

Referee: [abstract / main identity] Abstract and statement of the main identity: the manuscript asserts that the algebraic identity has been proved and that every listed result follows directly from it, yet supplies no expansion steps, lemmas, or explicit algebraic manipulations showing how log E[∏ π_i^{α_i}(x)] equals the KL-barycenter optimum or the exponential-family normalizer. This absence is load-bearing for the unification claim.

Authors: We accept that the current compact statement of the identity, while algebraically direct from the definition of the mixed expectation, does not include intermediate lemmas or step-by-step expansions. The equivalences to the KL-barycenter optimum and exponential-family normalizer follow from standard convex duality and the definition of the log-moment generating function, but these connections were left implicit. In revision we will insert a new subsection (after the statement of the identity) containing two short lemmas: one deriving the KL-barycenter representation via the variational definition of KL divergence, and one recovering the exponential-family normalizer via the cumulant function. These lemmas will contain the explicit algebraic manipulations requested, making the unification claim self-contained. revision: yes
Referee: [abstract] Abstract: the identity is claimed to hold for arbitrary real (including negative) α_i and for unnormalized or continuum-indexed priors, but no verification, domain restrictions, or handling of cases where ∏ π_i^{α_i}(x) becomes undefined or infinite (e.g., when some α_i < 0 and some π_i(x) = 0) is provided. Such cases directly affect whether the claimed interchange with Donsker-Varadhan or max-ent forms remains valid for the Sanov and Chernoff specializations.

Authors: The manuscript asserts the identity for unnormalized and continuum-indexed priors, yet we agree that the domain statement is insufficiently precise. Negative exponents require that the support of each π_i be respected to keep the product finite and positive. In the revision we will add an explicit domain paragraph stating that the identity holds when either (i) all α_i ≥ 0, or (ii) α_i < 0 only for those i where π_i(x) > 0 almost everywhere under ν, with the convention 0^0 := 1 for measure-zero sets. We will also verify that the Sanov and Chernoff specializations remain valid under these restrictions because the classical statements already impose the necessary support conditions on the empirical measures. A short remark will note that the Donsker-Varadhan and max-ent interchanges continue to hold on this restricted domain. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim is a direct algebraic identity

full rationale

The paper states it proves an algebraic mixed coincidence identity by direct means, with the log-expectation of the product of powered priors serving as the unifying object that specializes to known variational results. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the derivation is presented as rearrangement and specialization of the identity itself. The paper is self-contained against external benchmarks for the algebraic step, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract alone; the paper rests on the truth of the stated algebraic identity and on the ability to specialize it to the listed classical results. No free parameters, invented entities, or additional axioms are mentioned.

axioms (1)

ad hoc to paper The mixed coincidence identity holds for arbitrary real exponents and for unnormalized or continuum-indexed priors
This equality is the load-bearing statement whose proof is claimed in the abstract.

pith-pipeline@v0.9.1-grok · 5845 in / 1426 out tokens · 37804 ms · 2026-06-25T21:57:04.128797+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

All you need is log
cs.IT 2026-06 unverdicted novelty 8.0

The unique family of multi-distribution Rényi functionals is the positive integral of coincidence divergences C_α over the simplex interior, mixed-sign cones, tropical boundary, and KL edges.

Reference graph

Works this paper leans on

105 extracted references · 10 canonical work pages · cited by 1 Pith paper

[1]

Brownian excursions, critical random graphs and the multiplicative coalescent

David Aldous. Brownian excursions, critical random graphs and the multiplicative coalescent. The Annals of Proba- bility, pages 812–854, 1997

1997
[2]

A variational characterization of rényi divergences

Venkat Anantharam. A variational characterization of rényi divergences. IEEE Transactions on Information Theory, 64(11):6979–6989, 2018

2018
[3]

An inequality on guessing and its application to sequential decoding

Erdal Arikan. An inequality on guessing and its application to sequential decoding. IEEE Transactions on Information Theory, 42(1):99–105, 2002

2002
[4]

The multiplicative weights update method: a meta-algorithm and appli- cations

Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and appli- cations. Theory of computing, 8(1):121–164, 2012

2012
[5]

Projection theorems for the rényi divergence on α-convex sets

M Ashok Kumar and Igal Sason. Projection theorems for the rényi divergence on α-convex sets. IEEE Transactions on Information Theory, 62(9):4924–4935, 2016

2016
[6]

Cramér–rao lower bounds arising from generalized csiszár divergences

M Ashok Kumar and Kumar Vijay Mishra. Cramér–rao lower bounds arising from generalized csiszár divergences. Information Geometry, 3(1):33–59, 2020

2020
[7]

Robust bounds on risk-sensitive functionals via rényi divergence

Rami Atar, Kenny Chowdhary, and Paul Dupuis. Robust bounds on risk-sensitive functionals via rényi divergence. SIAM/ASA Journal on Uncertainty Quantification, 3(1):18–33, 2015

2015
[8]

A Modification of the Sequential Probability Ratio Test to Reduce the Sample Size

Raghu Raj Bahadur and R. Ranga Rao. On deviations of the sample mean. The Annals of Mathematical Statistics , 31 (4):1015–1027, 1960. doi: 10.1214/aoms/1177705694

work page doi:10.1214/aoms/1177705694 1960
[9]

Adaptive sampling for efficient softmax approximation

Tavor Z Baharav, Daniel Kang, Colin Sullivan, Mo Tiwari, Eric Luxenberg, David Tse, and Mert Pilanci. Adaptive sampling for efficient softmax approximation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[10]

Sharp finite-sample concentration of independent variables

Akshay Balsubramani. Sharp finite-sample concentration of independent variables. arXiv preprint arXiv:2008.13293, 2020

arXiv 2008
[11]

Entropy, concentration, and learning: a statistical mechanics primer

Akshay Balsubramani. Entropy, concentration, and learning: a statistical mechanics primer. arXiv preprint arXiv:2409.18630, 2024

arXiv 2024
[12]

Introduction to smooth ergodic theory , volume 231

Luis Barreira and Yakov Pesin. Introduction to smooth ergodic theory , volume 231. American Mathematical Society, 2023

2023
[13]

Upper and lower bounds on the renyi dimensions and the uniformity of multifractals

Christian Beck. Upper and lower bounds on the renyi dimensions and the uniformity of multifractals. Physica D: Nonlinear Phenomena, 41(1):67–78, 1990

1990
[14]

On a measure of divergence between two statistical populations defined by their probability distribution

Anil Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society , 35:99–110, 1943

1943
[15]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning (ICML), 2023

2023
[16]

Variational representations and neural network estimation of rényi divergences

Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Luc Rey-Bellet, and Jie Wang. Variational representations and neural network estimation of rényi divergences. SIAM Journal on Mathematics of Data Science, 3(4):1093–1116, 2021

2021
[17]

Variational inference: A review for statisticians

David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017

2017
[18]

Concentration Inequalities: A Nonasymptotic Theory of Inde- pendence

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Inde- pendence. Oxford University Press, 2013

2013
[19]

On rényi entropies and their applications to guessing attacks in cryptography

Serdar Boztas. On rényi entropies and their applications to guessing attacks in cryptography. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences , 97(12):2542–2548, 2014. 28

2014
[20]

Conditional Rényi divergence saddlepoint and the maximization of α-mutual informa- tion

Cong Cai and Sergio Verdú. Conditional Rényi divergence saddlepoint and the maximization of α-mutual informa- tion. Entropy, 21(10):969, 2019. doi: 10.3390/e21100969

work page doi:10.3390/e21100969 2019
[21]

A coding theorem and rényi’s entropy

L Lorne Campbell. A coding theorem and rényi’s entropy. Information and control, 8(4):423–429, 1965

1965
[22]

Definition of entropy by means of a coding problem

LL Campbell. Definition of entropy by means of a coding problem. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 6(2):113–118, 1966

1966
[23]

Eugenio Clerico, Tyler Farghly, George Deligiannidis, Benjamin Guedj, and Arnaud Doucet

Olivier Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning , volume 56 of Institute of Mathematical Statistics Lecture Notes–Monograph Series . Institute of Mathematical Statistics, Beachwood, OH, 2007. doi: 10.1214/074921707000000391

work page doi:10.1214/074921707000000391 2007
[24]

Micro-canonical cascades and random homeomorphisms

Xinxin Chen, Yong Han, Yanqi Qiu, and Zipeng Wang. Micro-canonical cascades and random homeomorphisms. arXiv preprint arXiv:2505.16405, 2025

arXiv 2025
[25]

A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, pages 493–507, 1952

Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, pages 493–507, 1952

1952
[26]

Multifractal formalism derived from thermodynamics for general dynamical systems.Electronic Research Announcements, 17:1–11, 2010

Vaughn Climenhaga. Multifractal formalism derived from thermodynamics for general dynamical systems.Electronic Research Announcements, 17:1–11, 2010

2010
[27]

Universal randomized guessing subject to distortion

Asaf Cohen and Neri Merhav. Universal randomized guessing subject to distortion. IEEE Transactions on Information Theory, 68(12):7714–7734, 2022

2022
[28]

I. Csiszár. I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3 (1):146–158, 1975

1975
[29]

I. Csiszár. Sanov property, generalized I-projection and a conditional limit theorem. Annals of Probability , 12(3): 768–793, 1984

1984
[30]

Csiszár and F

I. Csiszár and F. Matus. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, June 2003. ISSN 0018-9448

2003
[31]

Generalized cutoff rates and rényi’s information measures

Imre Csiszár. Generalized cutoff rates and rényi’s information measures. IEEE Transactions on information theory , 41(1):26–34, 2002

2002
[32]

Information theory and statistics: A tutorial

Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Commu- nications and Information Theory, 1(4):417–528, 2004

2004
[33]

H. E. Daniels. Saddlepoint approximations in statistics. The Annals of Mathematical Statistics , 25(4):631–650, 1954. doi: 10.1214/aoms/1177728652

work page doi:10.1214/aoms/1177728652 1954
[34]

Large deviations techniques and applications

Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications. Stochastic Modelling and Applied Probability, 2010

2010
[35]

Large deviations for a general class of random vectors

Richard S Ellis. Large deviations for a general class of random vectors. The Annals of Probability, 12(1):1–12, 1984

1984
[36]

On a new law of large numbers

Paul Erdös and Alfred Rényi. On a new law of large numbers. Journal d’Analyse Mathématique, 23(1):103–111, 1970

1970
[37]

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630:625–630, 2024

2024
[38]

Adaptive game playing using multiplicative weights

Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999

1999
[39]

Laurent, Anqi Shao, Maria Del Mar Alvarez-T orres, Tianji Yu, Jimin Tan, Jiayu Su, Romella Sagatelian, Adolfo A

Xi Fu, Shentong Mo, Alejandro Buendia, Anouchka P. Laurent, Anqi Shao, Maria Del Mar Alvarez-T orres, Tianji Yu, Jimin Tan, Jiayu Su, Romella Sagatelian, Adolfo A. Ferrando, Alberto Ciccia, Yanyan Lan, David M. Owens, T eresa Palomero, Eric P. Xing, and Raul Rabadan. A foundation model of transcription across human cell types. Nature, 637(8047):965–973, 2...

work page doi:10.1038/s41586-024-08391-z 2025
[40]

On large deviations from the invariant measure

Jürgen Gärtner. On large deviations from the invariant measure. Theory of Probability & Its Applications, 22(1):24–39, 1977

1977
[41]

A characterization theorem for externally bayesian groups

Christian Genest. A characterization theorem for externally bayesian groups. The Annals of Statistics , 12(3):1100– 1105, 1984

1984
[42]

Combining probability distributions: A critique and an annotated bibliography

Christian Genest and James V Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1(1):114–135, 1986

1986
[43]

continuity

Robert B Griffiths and David Ruelle. Strict convexity (“continuity”) of the pressure in lattice systems. Communications in Mathematical Physics, 23(3):169–175, 1971

1971
[44]

The minimum description length principle

Peter D Grünwald. The minimum description length principle. MIT press, 2007

2007
[45]

Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory

Peter D Grünwald and A Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. The Annals of Statistics, 32(4):1367–1433, 2004

2004
[46]

Regularized rényi divergence minimization through breg- man proximal gradient algorithms

Thomas Guilmeau, Emilie Chouzenoux, and Víctor Elvira. Regularized rényi divergence minimization through breg- man proximal gradient algorithms. Journal of Machine Learning Research, 26(157):1–56, 2025

2025
[47]

Training products of experts by minimizing contrastive divergence

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8): 1771–1800, 2002

2002
[48]

spaCy: Industrial-strength natural language processing in python, 2020

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-strength natural language processing in python, 2020. https://spacy.io

2020
[49]

Justification of logarithmic loss via the benefit of side information

Jiantao Jiao, Thomas A Courtade, Kartik Venkat, and Tsachy Weissman. Justification of logarithmic loss via the benefit of side information. IEEE Transactions on Information Theory, 61(10):5357–5365, 2015

2015
[50]

Axiomatic characterization of the directed divergences and their linear combinations

R Johnson. Axiomatic characterization of the directed divergences and their linear combinations. IEEE Transactions on Information Theory, 25(6):709–716, 1979

1979
[51]

Positive martingales and random measures

Jean-Pierre Kahane. Positive martingales and random measures. Chinese Annals of Mathematics Series B , 8(1):1–12, 1987

1987
[52]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013
[53]

Fusion of probability density functions

Günther Koliander, Yousef El-Laham, Petar M Djurić, and Franz Hlawatsch. Fusion of probability density functions. Proceedings of the IEEE, 110(4):404–453, 2022

2022
[54]

An axiomatic theory of fairness in network resource allocation

Tian Lan, David Kao, Mung Chiang, and Ashutosh Sabharwal. An axiomatic theory of fairness in network resource allocation. In Proceedings of the 29th conference on Information communications , pages 1343–1351, 2010

2010
[55]

Mixture of experts meets prompt-based continual learning.Advances in Neural Information Processing Systems, 37:119025–119062, 2024

Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo, and Nhat Ho. Mixture of experts meets prompt-based continual learning.Advances in Neural Information Processing Systems, 37:119025–119062, 2024

2024
[56]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Mike Lewis, and Luke Zettlemoyer. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , 2023. arXiv:2210.15097

arXiv 2023
[57]

On divergences and informations in statistics and information theory

Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Trans- actions on Information Theory, 52(10):4394–4412, 2006

2006
[58]

Divergence measures based on the shannon entropy

Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory , 37(1): 145–151, 1991

1991
[59]

Saddle point approximation for the distribution of the sum of independent random variables

Robert Lugannani and Stephen Rice. Saddle point approximation for the distribution of the sum of independent random variables. Advances in Applied Probability, 12(2):475–490, 1980. doi: 10.2307/1426607

work page doi:10.2307/1426607 1980
[60]

Algorithmic fractal dimensions in geometric measure theory

Jack H Lutz and Elvira Mayordomo. Algorithmic fractal dimensions in geometric measure theory. In Handbook of Computability and Complexity in Analysis , pages 271–302. Springer, 2021. 30

2021
[61]

Martyn, Michael T

Gabriella E. Martyn, Michael T. Montgomery, Hank Jones, Katherine Guo, Benjamin R. Doughty, Johannes Lin- der, Deepa Bisht, Fan Xia, Xiangmeng S. Cai, Ziwei Chen, Kelly Cochran, Kathryn A. Lawrence, Glen Munson, Anusri Pampari, Charles P. Fulco, Nidhi Sahni, David R. Kelley, Eric S. Lander, Anshul Kundaje, and Jesse M. En- greitz. Rewriting regulatory DNA...

work page doi:10.1016/j.cell.2025.03.034 2025
[62]

On the notion of affinity of several distributions and some of its applications

Kameo Matusita. On the notion of affinity of several distributions and some of its applications. Annals of the Institute of Statistical Mathematics, 19(1):181–192, 1967

1967
[63]

Some pac-bayesian theorems

David A McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

1998
[64]

Generalized q-dimensions of measures on non-autonomous conformal sets

Jun Jie Miao and Tianrui Wang. Generalized q-dimensions of measures on non-autonomous conformal sets. arXiv preprint arXiv:2512.19771, 2025

arXiv 2025
[65]

Fair end-to-end window-based congestion control

Jeonghoon Mo and Jean Walrand. Fair end-to-end window-based congestion control. IEEE/ACM Transactions on networking, 8(5):556–567, 2002

2002
[66]

From blackwell dominance in large samples to rényi divergences and back again

Xiaosheng Mu, Luciano Pomatto, Philipp Strack, and Omer Tamuz. From blackwell dominance in large samples to rényi divergences and back again. Econometrica, 89(1):475–506, 2021

2021
[67]

An information-geometric characterization of chernoff information

Frank Nielsen. An information-geometric characterization of chernoff information. IEEE Signal Processing Letters , 20(3):269–272, 2013

2013
[68]

Kernel language entropy: Fine-grained uncer- tainty quantification for LLMs

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncer- tainty quantification for LLMs. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[69]

Perspective on physical interpretations of rényi entropy in statistical mechanics

Misaki Ozawa and Nina Javerzat. Perspective on physical interpretations of rényi entropy in statistical mechanics. Europhysics Letters, 147(1):11001, 2024

2024
[70]

Characteristic lyapunov exponents and smooth ergodic theory

Ya B Pesin. Characteristic lyapunov exponents and smooth ergodic theory. Russian Mathematical Surveys, 32(4):55, 1977

1977
[71]

Repulsive mixtures

Francesca Petralia, Vinayak Rao, and David Dunson. Repulsive mixtures. Advances in neural information processing systems, 25, 2012

2012
[72]

Information theory: From coding to learning

Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning . Cambridge university press, 2025

2025
[73]

Poincar n’e on gibbs and on probability in statistical mechanics.arXiv preprint arXiv:2505.12168, 2025

Bruce D Popp. Poincar n’e on gibbs and on probability in statistical mechanics.arXiv preprint arXiv:2505.12168, 2025

arXiv 2025
[74]

Language models are unsu- pervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsu- pervised multitask learners. OpenAI technical report, 2019

2019
[75]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[76]

On the dimension and entropy of probability distributions

Alfréd Rényi. On the dimension and entropy of probability distributions. Acta Mathematica Academiae Scientiarum Hungarica, 10(1):193–215, 1959

1959
[77]

Dimension, entropy and information

Alfréd Rényi. Dimension, entropy and information. In Trans. 2nd Prague Conf. Information Theory, pages 545–556, 1960

1960
[78]

On measures of entropy and information

Alfréd Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathemat- ical statistics and probability, volume 1: contributions to the theory of statistics , volume 4, pages 547–562. University of California Press, 1961

1961
[79]

On the foundations of information theory

Alfréd Rényi. On the foundations of information theory. Revue de l’Institut International de Statistique , pages 1–14, 1965. 31

1965
[80]

On the probability of large deviations of random magnitudes

Ivan Nikolaevich Sanov. On the probability of large deviations of random magnitudes. Matematicheskii Sbornik, 84 (1):11–44, 1957. Translation at https://repository.lib.ncsu.edu/items/8f909775-ba1b-4874-acc2-362a8221edb0

1957

Showing first 80 references.

[1] [1]

Brownian excursions, critical random graphs and the multiplicative coalescent

David Aldous. Brownian excursions, critical random graphs and the multiplicative coalescent. The Annals of Proba- bility, pages 812–854, 1997

1997

[2] [2]

A variational characterization of rényi divergences

Venkat Anantharam. A variational characterization of rényi divergences. IEEE Transactions on Information Theory, 64(11):6979–6989, 2018

2018

[3] [3]

An inequality on guessing and its application to sequential decoding

Erdal Arikan. An inequality on guessing and its application to sequential decoding. IEEE Transactions on Information Theory, 42(1):99–105, 2002

2002

[4] [4]

The multiplicative weights update method: a meta-algorithm and appli- cations

Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and appli- cations. Theory of computing, 8(1):121–164, 2012

2012

[5] [5]

Projection theorems for the rényi divergence on α-convex sets

M Ashok Kumar and Igal Sason. Projection theorems for the rényi divergence on α-convex sets. IEEE Transactions on Information Theory, 62(9):4924–4935, 2016

2016

[6] [6]

Cramér–rao lower bounds arising from generalized csiszár divergences

M Ashok Kumar and Kumar Vijay Mishra. Cramér–rao lower bounds arising from generalized csiszár divergences. Information Geometry, 3(1):33–59, 2020

2020

[7] [7]

Robust bounds on risk-sensitive functionals via rényi divergence

Rami Atar, Kenny Chowdhary, and Paul Dupuis. Robust bounds on risk-sensitive functionals via rényi divergence. SIAM/ASA Journal on Uncertainty Quantification, 3(1):18–33, 2015

2015

[8] [8]

A Modification of the Sequential Probability Ratio Test to Reduce the Sample Size

Raghu Raj Bahadur and R. Ranga Rao. On deviations of the sample mean. The Annals of Mathematical Statistics , 31 (4):1015–1027, 1960. doi: 10.1214/aoms/1177705694

work page doi:10.1214/aoms/1177705694 1960

[9] [9]

Adaptive sampling for efficient softmax approximation

Tavor Z Baharav, Daniel Kang, Colin Sullivan, Mo Tiwari, Eric Luxenberg, David Tse, and Mert Pilanci. Adaptive sampling for efficient softmax approximation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[10] [10]

Sharp finite-sample concentration of independent variables

Akshay Balsubramani. Sharp finite-sample concentration of independent variables. arXiv preprint arXiv:2008.13293, 2020

arXiv 2008

[11] [11]

Entropy, concentration, and learning: a statistical mechanics primer

Akshay Balsubramani. Entropy, concentration, and learning: a statistical mechanics primer. arXiv preprint arXiv:2409.18630, 2024

arXiv 2024

[12] [12]

Introduction to smooth ergodic theory , volume 231

Luis Barreira and Yakov Pesin. Introduction to smooth ergodic theory , volume 231. American Mathematical Society, 2023

2023

[13] [13]

Upper and lower bounds on the renyi dimensions and the uniformity of multifractals

Christian Beck. Upper and lower bounds on the renyi dimensions and the uniformity of multifractals. Physica D: Nonlinear Phenomena, 41(1):67–78, 1990

1990

[14] [14]

On a measure of divergence between two statistical populations defined by their probability distribution

Anil Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society , 35:99–110, 1943

1943

[15] [15]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning (ICML), 2023

2023

[16] [16]

Variational representations and neural network estimation of rényi divergences

Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Luc Rey-Bellet, and Jie Wang. Variational representations and neural network estimation of rényi divergences. SIAM Journal on Mathematics of Data Science, 3(4):1093–1116, 2021

2021

[17] [17]

Variational inference: A review for statisticians

David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017

2017

[18] [18]

Concentration Inequalities: A Nonasymptotic Theory of Inde- pendence

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Inde- pendence. Oxford University Press, 2013

2013

[19] [19]

On rényi entropies and their applications to guessing attacks in cryptography

Serdar Boztas. On rényi entropies and their applications to guessing attacks in cryptography. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences , 97(12):2542–2548, 2014. 28

2014

[20] [20]

Conditional Rényi divergence saddlepoint and the maximization of α-mutual informa- tion

Cong Cai and Sergio Verdú. Conditional Rényi divergence saddlepoint and the maximization of α-mutual informa- tion. Entropy, 21(10):969, 2019. doi: 10.3390/e21100969

work page doi:10.3390/e21100969 2019

[21] [21]

A coding theorem and rényi’s entropy

L Lorne Campbell. A coding theorem and rényi’s entropy. Information and control, 8(4):423–429, 1965

1965

[22] [22]

Definition of entropy by means of a coding problem

LL Campbell. Definition of entropy by means of a coding problem. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 6(2):113–118, 1966

1966

[23] [23]

Eugenio Clerico, Tyler Farghly, George Deligiannidis, Benjamin Guedj, and Arnaud Doucet

Olivier Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning , volume 56 of Institute of Mathematical Statistics Lecture Notes–Monograph Series . Institute of Mathematical Statistics, Beachwood, OH, 2007. doi: 10.1214/074921707000000391

work page doi:10.1214/074921707000000391 2007

[24] [24]

Micro-canonical cascades and random homeomorphisms

Xinxin Chen, Yong Han, Yanqi Qiu, and Zipeng Wang. Micro-canonical cascades and random homeomorphisms. arXiv preprint arXiv:2505.16405, 2025

arXiv 2025

[25] [25]

A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, pages 493–507, 1952

Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, pages 493–507, 1952

1952

[26] [26]

Multifractal formalism derived from thermodynamics for general dynamical systems.Electronic Research Announcements, 17:1–11, 2010

Vaughn Climenhaga. Multifractal formalism derived from thermodynamics for general dynamical systems.Electronic Research Announcements, 17:1–11, 2010

2010

[27] [27]

Universal randomized guessing subject to distortion

Asaf Cohen and Neri Merhav. Universal randomized guessing subject to distortion. IEEE Transactions on Information Theory, 68(12):7714–7734, 2022

2022

[28] [28]

I. Csiszár. I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3 (1):146–158, 1975

1975

[29] [29]

I. Csiszár. Sanov property, generalized I-projection and a conditional limit theorem. Annals of Probability , 12(3): 768–793, 1984

1984

[30] [30]

Csiszár and F

I. Csiszár and F. Matus. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, June 2003. ISSN 0018-9448

2003

[31] [31]

Generalized cutoff rates and rényi’s information measures

Imre Csiszár. Generalized cutoff rates and rényi’s information measures. IEEE Transactions on information theory , 41(1):26–34, 2002

2002

[32] [32]

Information theory and statistics: A tutorial

Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Commu- nications and Information Theory, 1(4):417–528, 2004

2004

[33] [33]

H. E. Daniels. Saddlepoint approximations in statistics. The Annals of Mathematical Statistics , 25(4):631–650, 1954. doi: 10.1214/aoms/1177728652

work page doi:10.1214/aoms/1177728652 1954

[34] [34]

Large deviations techniques and applications

Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications. Stochastic Modelling and Applied Probability, 2010

2010

[35] [35]

Large deviations for a general class of random vectors

Richard S Ellis. Large deviations for a general class of random vectors. The Annals of Probability, 12(1):1–12, 1984

1984

[36] [36]

On a new law of large numbers

Paul Erdös and Alfred Rényi. On a new law of large numbers. Journal d’Analyse Mathématique, 23(1):103–111, 1970

1970

[37] [37]

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630:625–630, 2024

2024

[38] [38]

Adaptive game playing using multiplicative weights

Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999

1999

[39] [39]

Laurent, Anqi Shao, Maria Del Mar Alvarez-T orres, Tianji Yu, Jimin Tan, Jiayu Su, Romella Sagatelian, Adolfo A

Xi Fu, Shentong Mo, Alejandro Buendia, Anouchka P. Laurent, Anqi Shao, Maria Del Mar Alvarez-T orres, Tianji Yu, Jimin Tan, Jiayu Su, Romella Sagatelian, Adolfo A. Ferrando, Alberto Ciccia, Yanyan Lan, David M. Owens, T eresa Palomero, Eric P. Xing, and Raul Rabadan. A foundation model of transcription across human cell types. Nature, 637(8047):965–973, 2...

work page doi:10.1038/s41586-024-08391-z 2025

[40] [40]

On large deviations from the invariant measure

Jürgen Gärtner. On large deviations from the invariant measure. Theory of Probability & Its Applications, 22(1):24–39, 1977

1977

[41] [41]

A characterization theorem for externally bayesian groups

Christian Genest. A characterization theorem for externally bayesian groups. The Annals of Statistics , 12(3):1100– 1105, 1984

1984

[42] [42]

Combining probability distributions: A critique and an annotated bibliography

Christian Genest and James V Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1(1):114–135, 1986

1986

[43] [43]

continuity

Robert B Griffiths and David Ruelle. Strict convexity (“continuity”) of the pressure in lattice systems. Communications in Mathematical Physics, 23(3):169–175, 1971

1971

[44] [44]

The minimum description length principle

Peter D Grünwald. The minimum description length principle. MIT press, 2007

2007

[45] [45]

Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory

Peter D Grünwald and A Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. The Annals of Statistics, 32(4):1367–1433, 2004

2004

[46] [46]

Regularized rényi divergence minimization through breg- man proximal gradient algorithms

Thomas Guilmeau, Emilie Chouzenoux, and Víctor Elvira. Regularized rényi divergence minimization through breg- man proximal gradient algorithms. Journal of Machine Learning Research, 26(157):1–56, 2025

2025

[47] [47]

Training products of experts by minimizing contrastive divergence

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8): 1771–1800, 2002

2002

[48] [48]

spaCy: Industrial-strength natural language processing in python, 2020

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-strength natural language processing in python, 2020. https://spacy.io

2020

[49] [49]

Justification of logarithmic loss via the benefit of side information

Jiantao Jiao, Thomas A Courtade, Kartik Venkat, and Tsachy Weissman. Justification of logarithmic loss via the benefit of side information. IEEE Transactions on Information Theory, 61(10):5357–5365, 2015

2015

[50] [50]

Axiomatic characterization of the directed divergences and their linear combinations

R Johnson. Axiomatic characterization of the directed divergences and their linear combinations. IEEE Transactions on Information Theory, 25(6):709–716, 1979

1979

[51] [51]

Positive martingales and random measures

Jean-Pierre Kahane. Positive martingales and random measures. Chinese Annals of Mathematics Series B , 8(1):1–12, 1987

1987

[52] [52]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013

[53] [53]

Fusion of probability density functions

Günther Koliander, Yousef El-Laham, Petar M Djurić, and Franz Hlawatsch. Fusion of probability density functions. Proceedings of the IEEE, 110(4):404–453, 2022

2022

[54] [54]

An axiomatic theory of fairness in network resource allocation

Tian Lan, David Kao, Mung Chiang, and Ashutosh Sabharwal. An axiomatic theory of fairness in network resource allocation. In Proceedings of the 29th conference on Information communications , pages 1343–1351, 2010

2010

[55] [55]

Mixture of experts meets prompt-based continual learning.Advances in Neural Information Processing Systems, 37:119025–119062, 2024

Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo, and Nhat Ho. Mixture of experts meets prompt-based continual learning.Advances in Neural Information Processing Systems, 37:119025–119062, 2024

2024

[56] [56]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Mike Lewis, and Luke Zettlemoyer. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , 2023. arXiv:2210.15097

arXiv 2023

[57] [57]

On divergences and informations in statistics and information theory

Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Trans- actions on Information Theory, 52(10):4394–4412, 2006

2006

[58] [58]

Divergence measures based on the shannon entropy

Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory , 37(1): 145–151, 1991

1991

[59] [59]

Saddle point approximation for the distribution of the sum of independent random variables

Robert Lugannani and Stephen Rice. Saddle point approximation for the distribution of the sum of independent random variables. Advances in Applied Probability, 12(2):475–490, 1980. doi: 10.2307/1426607

work page doi:10.2307/1426607 1980

[60] [60]

Algorithmic fractal dimensions in geometric measure theory

Jack H Lutz and Elvira Mayordomo. Algorithmic fractal dimensions in geometric measure theory. In Handbook of Computability and Complexity in Analysis , pages 271–302. Springer, 2021. 30

2021

[61] [61]

Martyn, Michael T

Gabriella E. Martyn, Michael T. Montgomery, Hank Jones, Katherine Guo, Benjamin R. Doughty, Johannes Lin- der, Deepa Bisht, Fan Xia, Xiangmeng S. Cai, Ziwei Chen, Kelly Cochran, Kathryn A. Lawrence, Glen Munson, Anusri Pampari, Charles P. Fulco, Nidhi Sahni, David R. Kelley, Eric S. Lander, Anshul Kundaje, and Jesse M. En- greitz. Rewriting regulatory DNA...

work page doi:10.1016/j.cell.2025.03.034 2025

[62] [62]

On the notion of affinity of several distributions and some of its applications

Kameo Matusita. On the notion of affinity of several distributions and some of its applications. Annals of the Institute of Statistical Mathematics, 19(1):181–192, 1967

1967

[63] [63]

Some pac-bayesian theorems

David A McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

1998

[64] [64]

Generalized q-dimensions of measures on non-autonomous conformal sets

Jun Jie Miao and Tianrui Wang. Generalized q-dimensions of measures on non-autonomous conformal sets. arXiv preprint arXiv:2512.19771, 2025

arXiv 2025

[65] [65]

Fair end-to-end window-based congestion control

Jeonghoon Mo and Jean Walrand. Fair end-to-end window-based congestion control. IEEE/ACM Transactions on networking, 8(5):556–567, 2002

2002

[66] [66]

From blackwell dominance in large samples to rényi divergences and back again

Xiaosheng Mu, Luciano Pomatto, Philipp Strack, and Omer Tamuz. From blackwell dominance in large samples to rényi divergences and back again. Econometrica, 89(1):475–506, 2021

2021

[67] [67]

An information-geometric characterization of chernoff information

Frank Nielsen. An information-geometric characterization of chernoff information. IEEE Signal Processing Letters , 20(3):269–272, 2013

2013

[68] [68]

Kernel language entropy: Fine-grained uncer- tainty quantification for LLMs

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncer- tainty quantification for LLMs. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[69] [69]

Perspective on physical interpretations of rényi entropy in statistical mechanics

Misaki Ozawa and Nina Javerzat. Perspective on physical interpretations of rényi entropy in statistical mechanics. Europhysics Letters, 147(1):11001, 2024

2024

[70] [70]

Characteristic lyapunov exponents and smooth ergodic theory

Ya B Pesin. Characteristic lyapunov exponents and smooth ergodic theory. Russian Mathematical Surveys, 32(4):55, 1977

1977

[71] [71]

Repulsive mixtures

Francesca Petralia, Vinayak Rao, and David Dunson. Repulsive mixtures. Advances in neural information processing systems, 25, 2012

2012

[72] [72]

Information theory: From coding to learning

Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning . Cambridge university press, 2025

2025

[73] [73]

Poincar n’e on gibbs and on probability in statistical mechanics.arXiv preprint arXiv:2505.12168, 2025

Bruce D Popp. Poincar n’e on gibbs and on probability in statistical mechanics.arXiv preprint arXiv:2505.12168, 2025

arXiv 2025

[74] [74]

Language models are unsu- pervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsu- pervised multitask learners. OpenAI technical report, 2019

2019

[75] [75]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023

[76] [76]

On the dimension and entropy of probability distributions

Alfréd Rényi. On the dimension and entropy of probability distributions. Acta Mathematica Academiae Scientiarum Hungarica, 10(1):193–215, 1959

1959

[77] [77]

Dimension, entropy and information

Alfréd Rényi. Dimension, entropy and information. In Trans. 2nd Prague Conf. Information Theory, pages 545–556, 1960

1960

[78] [78]

On measures of entropy and information

Alfréd Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathemat- ical statistics and probability, volume 1: contributions to the theory of statistics , volume 4, pages 547–562. University of California Press, 1961

1961

[79] [79]

On the foundations of information theory

Alfréd Rényi. On the foundations of information theory. Revue de l’Institut International de Statistique , pages 1–14, 1965. 31

1965

[80] [80]

On the probability of large deviations of random magnitudes

Ivan Nikolaevich Sanov. On the probability of large deviations of random magnitudes. Matematicheskii Sbornik, 84 (1):11–44, 1957. Translation at https://repository.lib.ncsu.edu/items/8f909775-ba1b-4874-acc2-362a8221edb0

1957