pith. machine review for the scientific record. sign in

arxiv: 2604.27743 · v2 · submitted 2026-04-30 · 💻 cs.IT · cs.AI· cs.LG· math.IT

Recognition: unknown

Why Self-Supervised Encoders Want to Be Normal

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:27 UTC · model grok-4.3

classification 💻 cs.IT cs.AIcs.LGmath.IT
keywords self-supervised learninginformation bottleneckisotropic Gaussianpredictive manifoldrate-distortionrepresentation collapsejoint-embedding architecturesregularization
0
0 comments X

The pith

Self-supervised encoders prefer normal distributions because this satisfies the information bottleneck principle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that self-supervised encoders converge to normal distributions as a direct outcome of optimizing the information bottleneck. By reframing the bottleneck objective as a rate-distortion problem on the manifold of predictive distributions, optimal target-neutral representations become isotropic Gaussians. This structure organizes latents as soft clusters of inputs with similar future predictions inside a simplex. A reader would care because the result explains why popular regularizers keep embeddings informative and high-entropy and offers a geometric account of representation collapse. The same view unifies many supervised and self-supervised losses under one principle.

Core claim

By recasting the Information Bottleneck objective as a rate-distortion problem over the predictive manifold, optimal target-neutral latent representations tend toward isotropic Gaussian states. These representations correspond to soft clustering of inputs that share similar predictive distributions, organized within a natural simplex structure. The framework unifies supervised and less-supervised objectives, supplies a principled account of common regularization schemes, and yields practical loss functions that approximate the desired structure while preventing collapse.

What carries the argument

The recasting of the Information Bottleneck as a rate-distortion problem over the predictive manifold, which forces target-neutral latents into isotropic Gaussian states organized as soft clusters in a simplex.

If this is right

  • Latent representations function as soft clusters grouped by shared predictive distributions inside a simplex.
  • The same principle unifies supervised, self-supervised, and joint-embedding objectives under one geometric account.
  • Common regularization terms are justified as approximations that keep representations close to the optimal Gaussian states.
  • Derived loss functions that enforce the simplex structure improve stability and prevent collapse on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures could be designed to operate directly on the predictive manifold rather than on raw latents, potentially improving sample efficiency.
  • The simplex view suggests that the effective dimensionality of useful representations is bounded by the number of distinct predictive behaviors in the data.
  • Rate-distortion tools from classical compression could be imported to derive new regularization schedules that explicitly control entropy.
  • Forcing a non-Gaussian prior in small-scale experiments would provide a direct test of whether performance drops exactly when the bottleneck constraint is violated.

Load-bearing premise

The information bottleneck, once rewritten as rate-distortion on the predictive manifold, must produce isotropic Gaussians for optimal target-neutral representations regardless of encoder architecture details.

What would settle it

A self-supervised encoder trained with an added constraint that forces non-Gaussian latents and still matches or exceeds standard performance on downstream tasks while preserving high mutual information with the input would contradict the necessity of the Gaussian preference.

Figures

Figures reproduced from arXiv: 2604.27743 by Yuval Domb.

Figure 1
Figure 1. Figure 1: Overview from left to right. (a) The IB optimal encoder. (b) Soft clustering in the view at source ↗
Figure 2
Figure 2. Figure 2: Soft clustering of the predictive manifold view at source ↗
Figure 3
Figure 3. Figure 3: Example (26) inside P3. Blue points: M = {w1, w2, w3}, the predictive manifold. Dashed triangle: convex hull of M (visual aid). Grey dot: marginal p(Y ) = (1/3, 1/3, 1/3). 10 view at source ↗
Figure 4
Figure 4. Figure 4: The achievable region (light blue trapezoid) in the ( view at source ↗
Figure 5
Figure 5. Figure 5: Tangent to the IB curve at operating point ( view at source ↗
Figure 6
Figure 6. Figure 6: The chain of transformations connecting the isotropic Gaussian to the flat Dirichlet on view at source ↗
Figure 7
Figure 7. Figure 7: Continuous predictive manifold: p(Y |x) traces a loop through P3 as θ varies over [0, 2π). 21 view at source ↗
Figure 8
Figure 8. Figure 8: Continuous experiment: Dirichlet means (blue) vs. ground-truth view at source ↗
Figure 9
Figure 9. Figure 9: Continuous experiment: Information Plane, Rate–Distortion curve, and view at source ↗
Figure 10
Figure 10. Figure 10: Discrete predictive manifold: 10 distinct view at source ↗
Figure 11
Figure 11. Figure 11: Discrete CEB experiment: Dirichlet means vs. ground-truth for increasing view at source ↗
Figure 12
Figure 12. Figure 12: Discrete CEB experiment: Information Plane, Rate–Distortion curve, and view at source ↗
Figure 13
Figure 13. Figure 13: FashionMNIST: VIB (Gaussian, R 16) vs. CEB (Dirichlet, P10). Top-left: error rate vs. β. Top-right: information plane I(W; Y ) vs. I(X; W) (CEB only). Bottom-left: rate–distortion curve I(X; Y |W) vs. I(X; W) (CEB only). Bottom-right: distortion ε = I(X; Y |W) vs. β (CEB only). CEB achieves higher accuracy at lower rate with no parametric assumptions beyond the dimension of W. vertices can still carry mos… view at source ↗
Figure 14
Figure 14. Figure 14: CEB ablation over simplex dimension K at β = 25. Accuracy saturates at K=|Y|=10. Further increasing K reduces distortion but does not improve classification. and self-supervised learning. The Conditional Entropy Bottleneck decomposition yields practical encoder losses for supervised and semi-supervised settings. In the self-supervised setting, SIGReg serves as the regularizer alongside a view-prediction p… view at source ↗
read the original abstract

Self-supervised learning has achieved remarkable empirical success in learning robust representations without explicit labels, most recently demonstrated within the framework of Joint-Embedding Predictive Architectures (JEPA). However, a fundamental question remains: what analytical principles drive these encoders toward specific distributional states? In this paper, we demonstrate that the preference for normal distributions in self-supervised encoders is a direct consequence of the Information Bottleneck (IB) principle. By recasting the IB objective as a rate-distortion problem over the predictive manifold, we provide a theoretical basis for why optimal, target-neutral, latent representations should tend towards isotropic Gaussian states. Under this framework, we show that latent representations correspond to soft clustering of inputs sharing similar predictive distributions, organized within a natural simplex structure. This perspective unifies a wide range of existing supervised and less-supervised objectives and provides a principled explanation for commonly used regularization schemes. Furthermore, we derive practical loss objectives that approximate this structure and demonstrate their effectiveness on standard benchmarks. Ultimately, our framework offers a geometric lens to understanding representation collapse and it establishes a mathematical system for regularization strategies to be used to ensure high-entropy, informative embeddings in modern self-supervised models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the preference for isotropic Gaussian latent representations in self-supervised encoders (particularly JEPA-style models) is a direct consequence of the Information Bottleneck (IB) principle. By recasting the IB objective as a rate-distortion optimization problem over the manifold of predictive distributions, the authors argue that optimal target-neutral latents minimize to isotropic Gaussians. They interpret these latents as soft clustering of inputs with similar predictive behavior, naturally organized on a simplex structure. This framework unifies a range of supervised and self-supervised objectives, explains common regularization schemes, derives practical approximating losses, demonstrates their effectiveness on standard benchmarks, and offers a geometric perspective on representation collapse.

Significance. If the rate-distortion recasting is shown to be robust, the work supplies a principled information-theoretic explanation for the Gaussian bias observed in modern SSL and a geometric account of collapse, which could inform regularization design for high-entropy embeddings. The unification of objectives and derivation of practical losses are concrete strengths that could influence practice. The significance is reduced, however, by the absence of a demonstration that the Gaussian outcome is independent of the chosen manifold metric and distortion function; prior IB applications in representation learning have not uniformly produced the same conclusion, so the result risks being parameterization-specific rather than a universal consequence of IB.

major comments (2)
  1. [§3] §3 (Rate-Distortion Recasting of IB): the claim that isotropic Gaussians are the preferred states for target-neutral latents rests on an implicit Euclidean geometry of the predictive manifold together with quadratic distortion and differential entropy as the rate; the manuscript provides no proof or sensitivity analysis showing that the minimizer remains Gaussian when the distortion is replaced by, e.g., KL divergence or Wasserstein distance on the manifold. This assumption is load-bearing for the central assertion that the Gaussian preference is a 'direct consequence' of IB rather than an artifact of the chosen parameterization.
  2. [§4] §4 (Simplex Structure and Soft Clustering): the asserted 'natural simplex structure' for organizing soft clusters of predictive distributions is in tension with the unbounded support of isotropic Gaussians; the paper does not specify the embedding or high-dimensional limit that reconciles a fixed-support simplex geometry with Gaussian distributions, nor does it show that the Gaussian remains optimal once this embedding is made explicit. Without this clarification the derivation risks circularity with the chosen representation of the predictive manifold.
minor comments (2)
  1. [Abstract] The term 'target-neutral' is introduced in the abstract without definition; a one-sentence clarification in the introduction or §2 would improve accessibility.
  2. [§4] The unification of existing objectives (contrastive, predictive, etc.) is asserted but not accompanied by an explicit mapping table or equation-by-equation comparison; adding such a table in §4 would strengthen the unification claim without altering the central argument.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We appreciate the referee's detailed feedback, which highlights important aspects of our rate-distortion formulation and geometric interpretation. We respond to each major comment below and will incorporate revisions to strengthen the manuscript's clarity and address the concerns about parameterization dependence.

read point-by-point responses
  1. Referee: [§3] §3 (Rate-Distortion Recasting of IB): the claim that isotropic Gaussians are the preferred states for target-neutral latents rests on an implicit Euclidean geometry of the predictive manifold together with quadratic distortion and differential entropy as the rate; the manuscript provides no proof or sensitivity analysis showing that the minimizer remains Gaussian when the distortion is replaced by, e.g., KL divergence or Wasserstein distance on the manifold. This assumption is load-bearing for the central assertion that the Gaussian preference is a 'direct consequence' of IB rather than an artifact of the chosen parameterization.

    Authors: We thank the referee for pointing out this important qualification. The recasting in §3 indeed employs the standard quadratic distortion measure, which, together with differential entropy, yields the well-known Gaussian rate-distortion function. This choice aligns with the L2-based predictive losses commonly used in JEPA-style models. We acknowledge that the manuscript does not include a sensitivity analysis for alternative distortions such as KL or Wasserstein. In the revision, we will add a new subsection discussing the role of the distortion function, explicitly stating that the isotropic Gaussian solution is derived under quadratic distortion, and noting that other metrics may lead to different optimal distributions (e.g., Laplace for L1). We will also clarify that our claim of a 'direct consequence' is within the context of this standard rate-distortion setup for continuous latent spaces. This revision will make the assumptions transparent and mitigate the risk of the result appearing parameterization-specific. revision: yes

  2. Referee: [§4] §4 (Simplex Structure and Soft Clustering): the asserted 'natural simplex structure' for organizing soft clusters of predictive distributions is in tension with the unbounded support of isotropic Gaussians; the paper does not specify the embedding or high-dimensional limit that reconciles a fixed-support simplex geometry with Gaussian distributions, nor does it show that the Gaussian remains optimal once this embedding is made explicit. Without this clarification the derivation risks circularity with the chosen representation of the predictive manifold.

    Authors: We agree that the relationship between the simplex geometry of the predictive distributions and the Gaussian latents requires clearer exposition to avoid any appearance of circularity. The simplex structure refers to the convex combinations in the space of predictive distributions (i.e., the probability simplex over possible future states or targets), while the latent representations are modeled as isotropic Gaussians in the embedding space. The embedding is such that each Gaussian corresponds to a soft cluster center in the predictive manifold, with the means of the Gaussians lying within the convex hull (simplex) of the predictive modes. In high dimensions, the variance of the Gaussians allows for the unbounded support while the effective clustering is determined by the means. We will revise §4 to include an explicit description of this embedding, perhaps with a diagram or formal definition of how the Gaussian parameters map to points on the predictive simplex. Additionally, we will show that the optimality of the Gaussian form holds as long as the rate is measured by differential entropy and distortion by quadratic distance in the latent space, independent of the specific simplex embedding. This should resolve the tension and strengthen the geometric perspective. revision: yes

standing simulated objections not resolved
  • A general demonstration that the isotropic Gaussian minimizer is independent of the manifold metric and distortion function (e.g., for KL divergence or Wasserstein distance) is not provided in the current manuscript and would require additional theoretical development.

Circularity Check

0 steps flagged

No significant circularity; central claim builds on standard external IB principle without self-referential reduction or fitted inputs renamed as predictions.

full rationale

The abstract presents the Gaussian preference as a consequence of recasting the Information Bottleneck (a pre-existing principle from Tishby et al.) as a rate-distortion problem on the predictive manifold. No equations are supplied in the visible text that define the manifold or distortion measure in terms of the target Gaussian outcome, nor is there a self-citation to a uniqueness theorem authored by the same writer that would force the result. The simplex structure for soft clustering is described as a derived perspective rather than an input assumption. Because the derivation chain relies on an independent, externally established objective (IB) and does not exhibit any of the enumerated circular patterns (self-definition, fitted-input prediction, load-bearing self-citation, etc.) in the provided material, the paper remains self-contained against external benchmarks. Full equations would be needed to confirm the manifold metric is not chosen to guarantee isotropy, but nothing in the given text triggers a circularity flag.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract; the paper relies on the standard Information Bottleneck principle and introduces a predictive manifold structure whose details are not visible. No free parameters or invented entities are explicitly listed in the abstract.

axioms (1)
  • domain assumption The Information Bottleneck objective can be recast as a rate-distortion problem over the predictive manifold
    This recasting is the key step that leads to the Gaussian conclusion as stated in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1284 out tokens · 47646 ms · 2026-05-07T05:27:17.178219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 3 canonical work pages

  1. [1]

    A scale-dependent notion of effective dimension.arXiv preprint arXiv:2001.10872, 2020

    Aqeel Abbas and Marcus Hutter. A scale-dependent notion of effective dimension.arXiv preprint arXiv:2001.10872, 2020

  2. [2]

    Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018

    Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations.Journal of Machine Learning Research, 19(50):1–34, 2018. 26

  3. [3]

    Deep variational information bottleneck

    Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InInternational Conference on Learning Representations, 2017

  4. [4]

    Springer, 2016

    Shun-ichi Amari.Information Geometry and Its Applications. Springer, 2016

  5. [5]

    An algorithm for computing the capacity of arbitrary discrete memoryless channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

    Suguru Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

  6. [6]

    Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025

    Randall Balestriero and Yann LeCun. LeJEPA: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

  7. [7]

    Prentice- Hall, 1971

    Toby Berger.Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice- Hall, 1971

  8. [8]

    Wiley, 3 edition, 1995

    Patrick Billingsley.Probability and Measure. Wiley, 3 edition, 1995

  9. [9]

    Bishop.Pattern Recognition and Machine Learning

    Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 2006

  10. [10]

    Computation of channel capacity and rate-distortion functions.IEEE Trans- actions on Information Theory, 18(4):460–473, 1972

    Richard Blahut. Computation of channel capacity and rate-distortion functions.IEEE Trans- actions on Information Theory, 18(4):460–473, 1972

  11. [11]

    Combining labeled and unlabeled data with co-training

    Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. InProceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100. ACM, 1998

  12. [12]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´ e J´ egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InInterna- tional Conference on Computer Vision, pages 9650–9660, 2021

  13. [13]

    Information bottleneck for Gaussian variables.Journal of Machine Learning Research, 6:165–188, 2005

    Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for Gaussian variables.Journal of Machine Learning Research, 6:165–188, 2005

  14. [14]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607, 2020

  15. [15]

    Wiley-Interscience, 2nd edition, 2006

    Thomas M Cover and Joy A Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006

  16. [16]

    A test for normality based on the empirical charac- teristic function.Biometrika, 70(3):723–726, 1983

    Thomas W Epps and Lawrence B Pulley. A test for normality based on the empirical charac- teristic function.Biometrika, 70(3):723–726, 1983

  17. [17]

    John Wiley & Sons, 2nd edition, 2003

    Kenneth Falconer.Fractal Geometry: Mathematical Foundations and Applications. John Wiley & Sons, 2nd edition, 2003

  18. [18]

    Testing the manifold hypothesis

    Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016

  19. [19]

    A Bayesian analysis of some nonparametric problems.The Annals of Statistics, 1(2):209–230, 1973

    Thomas S Ferguson. A Bayesian analysis of some nonparametric problems.The Annals of Statistics, 1(2):209–230, 1973

  20. [20]

    The conditional entropy bottleneck.Entropy, 22(9):999, 2020

    Ian Fischer. The conditional entropy bottleneck.Entropy, 22(9):999, 2020

  21. [21]

    On the mathematical foundations of theoretical statistics.Philosophical Transactions of the Royal Society of London A, 222:309–368, 1922

    Ronald A Fisher. On the mathematical foundations of theoretical statistics.Philosophical Transactions of the Royal Society of London A, 222:309–368, 1922. 27

  22. [22]

    Common information is far less than mutual information

    Peter G´ acs and J´ anos K¨ orner. Common information is far less than mutual information. Problems of Control and Information Theory, 2(2):149–162, 1973

  23. [23]

    The information bottleneck problem and its applications in machine learning.IEEE Journal on Selected Areas in Information Theory, 1(1):99–129, 2020

    Ziv Goldfeld and Yury Polyanskiy. The information bottleneck problem and its applications in machine learning.IEEE Journal on Selected Areas in Information Theory, 1(1):99–129, 2020

  24. [24]

    Semi-supervised learning by entropy minimization

    Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, volume 17, 2004

  25. [25]

    Boot- strap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch´ e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Chen, Michal Valko, et al. Boot- strap your own latent: A new approach to self-supervised learning. InAdvances in Neural Information Processing Systems, 2020

  26. [26]

    The information bottleneck revisited or how to choose a good distortion measure

    Peter Harrem¨ oes and Naftali Tishby. The information bottleneck revisited or how to choose a good distortion measure. InIEEE International Symposium on Information Theory, pages 566–570, 2007

  27. [27]

    Edwin T. Jaynes. Information theory and statistical mechanics.Physical Review, 106(4):620– 630, 1957

  28. [28]

    Dirichlet variational autoen- coder.Pattern Recognition, 107:107514, 2020

    Weonyoung Joo, Wonsung Lee, Sungrae Park, and Il-Chul Moon. Dirichlet variational autoen- coder.Pattern Recognition, 107:107514, 2020

  29. [29]

    Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling

    Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi- supervised learning with deep generative models. InAdvances in Neural Information Processing Systems, volume 27, 2014

  30. [30]

    Nonlinear information bottle- neck.Entropy, 21(12):1181, 2019

    Artemy Kolchinsky, Brendan D Tracey, and Steven Van Kuyk. Nonlinear information bottle- neck.Entropy, 21(12):1181, 2019

  31. [31]

    Andrei N Kolmogorov and Vladimir M Tikhomirov.ε-entropy andε-capacity of sets in func- tional spaces.American Mathematical Society Translations, 17:277–364, 1961

  32. [32]

    A path towards autonomous machine intelligence

    Yann LeCun. A path towards autonomous machine intelligence. Technical report, Meta AI, 2022

  33. [33]

    Completeness, similar regions, and unbiased estimation: Part I.Sankhy¯ a, 10(4):305–340, 1950

    Erich L Lehmann and Henry Scheff´ e. Completeness, similar regions, and unbiased estimation: Part I.Sankhy¯ a, 10(4):305–340, 1950

  34. [34]

    The intrinsic dimension of images and its impact on learning

    Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. InInternational Conference on Learning Representations, 2021

  35. [35]

    On the information bottleneck theory of deep learning

    Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Bren- dan D Tracey, and David D Cox. On the information bottleneck theory of deep learning. In International Conference on Learning Representations, 2018

  36. [36]

    Coding theorems for a discrete source with a fidelity criterion.IRE National Convention Record, 7(4):142–163, 1959

    Claude E Shannon. Coding theorems for a discrete source with a fidelity criterion.IRE National Convention Record, 7(4):142–163, 1959

  37. [37]

    To compress or not to compress—self-supervised learning and information theory: A review.Entropy, 26(3):252, 2024

    Ravid Shwartz-Ziv and Yann LeCun. To compress or not to compress—self-supervised learning and information theory: A review.Entropy, 26(3):252, 2024. 28

  38. [38]

    Opening the Black Box of Deep Neural Networks via Information

    Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810, 2017

  39. [39]

    The deterministic information bottleneck.Neural Computa- tion, 29(6):1611–1630, 2017

    DJ Strouse and David J Schwab. The deterministic information bottleneck.Neural Computa- tion, 29(6):1611–1630, 2017

  40. [40]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. InProceedings of the 37th Annual Allerton Conference on Communication, Control, and Com- puting, pages 368–377, 1999

  41. [41]

    On mutual information maximization for representation learning

    Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. InInternational Conference on Learning Representations, 2020

  42. [42]

    A survey on semi-supervised learning.Machine Learning, 109(2):373–440, 2020

    Jesper E van Engelen and Holger H Hoos. A survey on semi-supervised learning.Machine Learning, 109(2):373–440, 2020

  43. [43]

    A conditional entropy bound for a pair of discrete random variables.IEEE Transactions on Information Theory, 21(5):493–501, 1975

    Hans S Witsenhausen and Aaron D Wyner. A conditional entropy bound for a pair of discrete random variables.IEEE Transactions on Information Theory, 21(5):493–501, 1975

  44. [44]

    Wyner and Jacob Ziv

    Aaron D. Wyner and Jacob Ziv. The rate-distortion function for source coding with side information at the decoder.IEEE Transactions on Information Theory, 22(1):1–10, 1976

  45. [45]

    manifold

    Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. InAdvances in Neural Information Processing Systems, 2017. Appendices A Effective Dimension in the Continuous Case WhenYis continuous,p(Y|x) is a density and the predictive manifoldM={p(Y|x) :x∈ X } is a subset of an infinite-dimensional functi...