pith. sign in

arxiv: 2405.05097 · v8 · pith:IIK37U7Snew · submitted 2024-05-08 · 💻 cs.LG · stat.ML

Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional propagation of values and densities

Pith reviewed 2026-05-24 00:54 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords joint distribution neuronshierarchical correlation reconstructionbidirectional propagationKolmogorov-Arnold networksmoment propagationlocal traininginformation bottleneckneural network architecture
0
0 comments X

The pith

Joint distribution neurons model local densities to enable bidirectional propagation, moment-based uncertainty handling, and local training alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes extending KAN-style neurons with an explicit model of the joint distribution over their inputs, written as a linear combination of basis functions over the unit hypercube. This model supports substituting observed values to recover conditional expectations or distributions for the remaining variables, propagating vectors of moments such as mean and variance, and training through direct fitting, tensor methods, or an information-bottleneck objective. A sympathetic reader would care because the construction directly targets three gaps between current artificial networks and biological ones: unidirectional flow, deterministic activation, and global back-propagation.

Core claim

Neurons containing the joint-density model ρ(x) = sum a_j f_j(x) for x in [0,1]^d allow repair of missing inputs by conditional evaluation, propagate distributions via moment vectors, and admit local training procedures including direct optimization and information-bottleneck updates, while remaining compatible with existing architectures such as transformers.

What carries the argument

The joint distribution representation ρ(x) = sum_{j in B} a_j f_j(x) that encodes correlations among inputs and supplies conditional values or moments on demand.

If this is right

  • Inputs can be repaired on the fly by solving for the conditional distribution given the observed coordinates.
  • Uncertainty can be propagated forward by carrying vectors of moments rather than single point estimates.
  • Training rules other than back-propagation become available, including direct fitting of the coefficients a_j and local information-bottleneck objectives.
  • The same representation can replace softmax layers in embedding models by treating learned features as mixed moments of an underlying joint density.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such neurons could support decentralized or continual learning scenarios where only local statistics are updated.
  • Interpreting transformer features as moments suggests a route to uncertainty-aware attention mechanisms.
  • The approach opens a concrete path for testing whether explicit joint-density modeling improves robustness on tasks that reward risk sensitivity.

Load-bearing premise

The joint distribution model can be trained and evaluated at practical cost while preserving accuracy comparable to standard layers.

What would settle it

A controlled benchmark in which networks built from these neurons require substantially more parameters or training time than MLPs or KANs to reach the same test accuracy on a standard classification or regression task.

Figures

Figures reproduced from arXiv: 2405.05097 by Jarek Duda.

Figure 1
Figure 1. Figure 1: uni-directional vs bi-directional propagation of biological axons [2], working only on values vs also on distributions - observed in animals e.g. as risk avoidance [3], and finally BNNs need local training approaches like looking the most promising: information bottleneck ([5], [8], [9]). 1https://www.newscientist.com/article/2517389-human-brain-cells-on-a-chip￾learned-to-play-doom-in-a-week/ [PITH_FULL_I… view at source ↗
Figure 2
Figure 2. Figure 2: Basic formulas and example for d = 2 variables HCR neuron, using convenient variable normalization to nearly uniform in [0, 1]. Neuron contains matrix of moments: aij (generally order d tensor), allowing to propagate in various directions by substituting some variables and normalizing to get estimated conditional density for the remaining - just permuting indexes to change propagation direction. For value … view at source ↗
Figure 3
Figure 3. Figure 3: Summary of differences between artificial (ANN) and bi [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The proposed HCR neuron and neural network (HCRN, HCRNN) [PITH_FULL_IMAGE:figures/full_fig_p002_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simple 2/3D examples from HCRNN Wolfram notebook of propagation [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: 2D example comparison of local basis KDE (kernel density estimation) vs global basis HCR (available code in HCR Wolfram notebook) modelling joint density for dataset as points shown on the right. Assuming ρ = 1 trivial joint density, we would get 0 log-likelihood evaluation (mean lg(ρ(x)) over dataset). Training on a randomly chosen subset and calculating log-likelihood on the remaining subset (cross-valid… view at source ↗
Figure 7
Figure 7. Figure 7: Top: Visualized part of HCR polynomial [0, 1] basis in d = 1 dimension and fj(x) = Qd i=1 fji (xi) product bases for d = 2, 3. E.g. for d = 3 the assumed joint density becomes ρ(x, y, z) = P ijk aijkfi(x)fj (y)fk(z). As f0 = 1, zero index in aijk means independence from given variable, hence a000 = 1 corresponds to normalization, ai00, a0i0, a00i for i ≥ 1 describe marginal distributions through i-th momen… view at source ↗
Figure 8
Figure 8. Figure 8: KAN-like example with code from HCRNN Wolfram notebook: trying [PITH_FULL_IMAGE:figures/full_fig_p004_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of direct prediction of conditional distribution from [22] [PITH_FULL_IMAGE:figures/full_fig_p005_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Independence test HSIC vs HCR comparison from [25] - we independently generate 2 data samples from bimodal distribution and introduce dependence by rotating it 0, 1, 2, 3, 4, 5 degrees (top). In HCR we model their joint distribution as polynomial - there are shown such density models and their |B| = 4 × 4 = 16 coefficients for m = 4. To distinguish signal from noise, these moments can be normalized to N(0… view at source ↗
Figure 14
Figure 14. Figure 14: Top: even not having ground truth for properties, we might be able to enforce network to work on their probability densities e.g. through replacement of softmax, un-embedding. There is now mainly used softmax for values (top row). HCR approach allows to replace one or both with density model for each property. There are also dependencies e.g. correlations between properties, suggesting to also include mix… view at source ↗
Figure 13
Figure 13. Figure 13: Top: embeddings are the basic tools of modern neural networks like transformers, representing various objects e.g. words as vectors, of parameters hopefully corresponding to real properties, like age. However, e.g. word ”adult” represents much larger age variance than ”toddler” - single feature like energy in softmax seems insufficient to describe this property, requiring entire probability distributions,… view at source ↗
Figure 15
Figure 15. Figure 15: Omnidirectional HCR neuron proposed in [10] - getting any [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
read the original abstract

Recently a million of biological neurons (BNN) has turned out better from modern RL methods in playing Pong~\cite{RL}, reminding they are still qualitatively superior e.g. in learning, flexibility and robustness - suggesting to try to improve current artificial e.g. MLP/KAN for better agreement with biological. There is proposed extension of KAN approach to neurons containing model of local joint distribution: $\rho(\mathbf{x})=\sum_{\mathbf{j}\in B} a_\mathbf{j} f_\mathbf{j}(\mathbf{x})$ for $\mathbf{x} \in [0,1]^d$, adding interpretation and information flow control to KAN, and allowing to gradually add missing 3 basic properties of biological: 1) biological axons propagate in both directions~\cite{axon}, while current artificial are focused on unidirectional propagation - joint distribution neurons can repair by substituting some variables to get conditional values/distributions for the remaining. 2) Animals show risk avoidance~\cite{risk} requiring to process variance, and generally real world rather needs probabilistic models - the proposed can predict and propagate also distributions as vectors of moments: (expected value, variance) or higher. 3) biological neurons require local training, and beside backpropagation, the proposed allows many additional ways, like direct training, through tensor decomposition, or finally local and promising: information bottleneck. Proposed approach is very general, can be also used as extension of softmax in embeddings of e.g. transformer or JEPA, suggesting interpretation that features are mixed moments of joint density of real-world properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript proposes joint distribution neurons as an extension of Kolmogorov-Arnold Networks, in which each neuron models a local joint distribution via the linear expansion ρ(x)=∑_{j∈B} a_j f_j(x) for x∈[0,1]^d. The central claim is that this form supplies three missing biological properties: (1) bidirectional propagation obtained by variable substitution to produce conditionals, (2) propagation of full distributions represented as moment vectors (mean, variance, …), and (3) local training routes including direct fitting, tensor decomposition, and the information bottleneck. The same construction is suggested as a drop-in replacement for softmax layers in transformers.

Significance. If the functional form could be equipped with concrete, tractable basis functions and training procedures that realize the three listed properties at scale, the work would supply a principled probabilistic primitive that unifies interpretation, uncertainty propagation, and locality of learning—potentially improving robustness and sample efficiency over standard MLPs or KANs. The absence of any such concrete realization, however, leaves the significance prospective rather than demonstrated.

major comments (4)
  1. [Abstract] Abstract: the claim that substitution of variables directly yields conditional distributions omits the marginalization integrals required for normalization; without an explicit product or separable structure on the unspecified f_j, these integrals are intractable for d>3 and therefore load-bearing for the bidirectional-propagation claim.
  2. [Abstract] Abstract: no choice of basis functions f_j, multi-index set B, non-negativity constraint, or normalization procedure for the coefficients a_j is supplied, rendering the three biological properties formal possibilities rather than demonstrated capabilities of the given expansion.
  3. [Abstract] Abstract: the information-bottleneck training route is asserted to be “local and promising,” yet no algorithm, objective, or complexity bound is derived that would show how the bottleneck can be optimized using only the linear coefficients a_j and the (unspecified) f_j.
  4. [Abstract] Abstract: the manuscript contains neither derivations, pseudocode, complexity analysis, nor any empirical result that would substantiate that the proposed neuron can be trained or evaluated at practical cost while preserving the claimed moment-propagation and conditioning properties.
minor comments (2)
  1. [Abstract] Abstract: grammatical phrasing “a million of biological neurons” and “the proposed can predict” should be corrected.
  2. [Abstract] Abstract: citation markers (e.g., ~cite{RL}, ~cite{axon}) appear without an accompanying reference list or context.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive critique. The manuscript is a concise conceptual proposal introducing the joint-distribution neuron form and arguing that it formally enables three biological properties. We respond point-by-point below, acknowledging where the current text is limited to the general expansion and where concrete realizations remain future work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that substitution of variables directly yields conditional distributions omits the marginalization integrals required for normalization; without an explicit product or separable structure on the unspecified f_j, these integrals are intractable for d>3 and therefore load-bearing for the bidirectional-propagation claim.

    Authors: We agree that obtaining a properly normalized conditional from the joint expansion generally requires marginalization integrals. The manuscript states that substitution yields conditionals, but does not claim this is automatic for arbitrary bases; the intent is that, once a concrete basis admitting closed-form or efficient marginals is chosen, the same linear coefficients allow both forward and backward propagation. The current text leaves the required structure on f_j implicit, which is a limitation of the presentation. revision: no

  2. Referee: [Abstract] Abstract: no choice of basis functions f_j, multi-index set B, non-negativity constraint, or normalization procedure for the coefficients a_j is supplied, rendering the three biological properties formal possibilities rather than demonstrated capabilities of the given expansion.

    Authors: The manuscript deliberately presents the most general linear expansion that still permits the three listed operations (variable substitution, moment-vector propagation, and local coefficient updates). Specific bases (e.g., multivariate polynomials or wavelets on [0,1]^d), non-negativity constraints, and normalization schemes are indeed omitted because the paper’s scope is to establish the functional form and its qualitative advantages over standard KAN neurons. Concrete instantiations are required for implementation and are noted as future work. revision: no

  3. Referee: [Abstract] Abstract: the information-bottleneck training route is asserted to be “local and promising,” yet no algorithm, objective, or complexity bound is derived that would show how the bottleneck can be optimized using only the linear coefficients a_j and the (unspecified) f_j.

    Authors: The claim is that the information-bottleneck objective can be expressed directly in terms of the coefficients a_j once the basis is fixed, because the modeled density is linear in those coefficients; this would in principle allow a local update without back-propagation through the rest of the network. No explicit algorithm or complexity analysis is supplied, as the manuscript only identifies the route as conceptually local. Deriving a practical optimizer is left for subsequent development. revision: no

  4. Referee: [Abstract] Abstract: the manuscript contains neither derivations, pseudocode, complexity analysis, nor any empirical result that would substantiate that the proposed neuron can be trained or evaluated at practical cost while preserving the claimed moment-propagation and conditioning properties.

    Authors: The manuscript is a short conceptual note whose contribution is the identification of the linear joint-density expansion and the three formal properties it enables. It therefore contains no empirical results, pseudocode, or complexity bounds. We accept that demonstrating practical cost and preservation of the properties requires concrete bases, training procedures, and experiments, none of which are present. revision: no

Circularity Check

0 steps flagged

No circularity: forward architectural proposal with independent claims

full rationale

The manuscript defines the joint density model ρ(x)=∑_{j∈B} a_j f_j(x) directly as an extension of KAN and then enumerates three biological properties (bidirectional repair via substitution, moment-vector propagation, and multiple local training routes) as consequences of that functional form. No step equates a claimed prediction or uniqueness result to a fitted parameter or prior self-citation; the conditioning argument is presented as a formal possibility of variable substitution without any reduction to an input equation or self-referential theorem. External citations (axon, risk, RL) are used only for motivation, not as load-bearing justification. The derivation chain therefore remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The proposal rests on the modeling assumption that local joint distributions can be expressed in the given linear combination form and that this enables the claimed capabilities; no independent evidence for practical performance is provided.

free parameters (1)
  • a_j coefficients
    The expansion coefficients in the joint density model ρ(x)=sum a_j f_j(x) would need to be determined during training.
axioms (1)
  • domain assumption The local joint distribution can be expressed as a linear combination of basis functions f_j(x)
    This is the core modeling assumption stated in the abstract for the neuron definition.
invented entities (1)
  • joint distribution neuron no independent evidence
    purpose: To model local joint distributions enabling bidirectional propagation and probabilistic outputs
    New neuron type introduced in the proposal.

pith-pipeline@v0.9.0 · 5814 in / 1407 out tokens · 44428 ms · 2026-05-24T00:54:51.927105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small Language Models

    cs.SE 2026-04 unverdicted novelty 6.0

    Small language models display distinct energy-use and coverage profiles when generating unit tests, with some models being more efficient while others offer higher stability.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Dynamic network plasticity and sample efficiency in biological neural cultures: A comparative study with deep reinforcement learning,

    M. Khajehnejad, F. Habibollahi, A. Loeffler, A. Paul, A. Razi, and B. J. Kagan, “Dynamic network plasticity and sample efficiency in biological neural cultures: A comparative study with deep reinforcement learning,” Cyborg and Bionic Systems, vol. 6, p. 0336, 2025

  2. [2]

    Dynamics of signal propagation and collision in axons,

    R. Follmann, E. Rosa Jr, and W. Stein, “Dynamics of signal propagation and collision in axons,”Physical Review E, vol. 92, no. 3, p. 032707, 2015

  3. [3]

    The concept of uncertainty in animal experi- ments using aversive stimulation

    H. Imada and Y . Nageishi, “The concept of uncertainty in animal experi- ments using aversive stimulation.”Psychological bulletin, vol. 91, no. 3, p. 573, 1982

  4. [4]

    Spiking neural networks and their applications: A review,

    K. Yamazaki, V .-K. V o-Ho, D. Bulsara, and N. Le, “Spiking neural networks and their applications: A review,”Brain sciences, vol. 12, no. 7, p. 863, 2022

  5. [5]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,”arXiv preprint physics/0004057, 2000

  6. [6]

    KAN: Kolmogorov-Arnold Networks

    Z. Liu, Y . Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Solja ˇci´c, T. Y . Hou, and M. Tegmark, “KAN: Kolmogorov-arnold networks,”arXiv preprint arXiv:2404.19756, 2024

  7. [7]

    Multilayer feedforward net- works are universal approximators,

    K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward net- works are universal approximators,”Neural networks, vol. 2, no. 5, pp. 359–366, 1989

  8. [8]

    Information bottleneck for gaussian variables,

    G. Chechik, A. Globerson, N. Tishby, and Y . Weiss, “Information bottleneck for gaussian variables,”Advances in Neural Information Processing Systems, vol. 16, 2003

  9. [9]

    Deep learning and the information bottleneck principle,

    N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in2015 ieee information theory workshop (itw). IEEE, 2015, pp. 1–5

  10. [10]

    Hierarchical correlation reconstruction with missing data, for example for biology-inspired neuron

    J. Duda, “Hierarchical correlation reconstruction with missing data, for example for biology-inspired neuron,”arXiv preprint arXiv:1804.06218, 2018

  11. [12]

    Modelling bid-ask spread condi- tional distributions using hierarchical correlation reconstruction,

    J. Duda, H. Gurgul, and R. Syrek, “Modelling bid-ask spread condi- tional distributions using hierarchical correlation reconstruction,”Statis- tics in Transition New Series, vol. 21, no. 5, 2020, preprint: https://arxiv.org/abs/1911.02361

  12. [13]

    Gamma-ray blazar variability: new statistical meth- ods of time-flux distributions,

    J. Duda and G. Bhatta, “Gamma-ray blazar variability: new statistical meth- ods of time-flux distributions,”Monthly Notices of the Royal Astronomical Society, vol. 508, no. 1, pp. 1446–1458, 2021

  13. [14]

    Prediction of probability distributions of molecular properties: towards more efficient virtual screening and better understanding of compound representations,

    J. Duda and S. Podlewska, “Prediction of probability distributions of molecular properties: towards more efficient virtual screening and better understanding of compound representations,”Molecular Diversity, pp. 1– 12, 2022

  14. [15]

    Predicting conditional probability distributions of redshifts of active galactic nuclei using hierarchical correlation reconstruc- tion,

    J. Duda and G. Bhatta, “Predicting conditional probability distributions of redshifts of active galactic nuclei using hierarchical correlation reconstruc- tion,”Monthly Notices of the Royal Astronomical Society, p. stae963, 2024

  15. [16]

    Kernelized information bottleneck leads to biologically plausible 3-factor hebbian learning in deep networks,

    R. Pogodin and P. Latham, “Kernelized information bottleneck leads to biologically plausible 3-factor hebbian learning in deep networks,”Advances in Neural Information Processing Systems, vol. 33, pp. 7296–7307, 2020

  16. [17]

    The HSIC bottleneck: Deep learning without back-propagation,

    W.-D. K. Ma, J. Lewis, and W. B. Kleijn, “The HSIC bottleneck: Deep learning without back-propagation,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 5085–5092

  17. [18]

    Copula theory: an introduction,

    F. Durante and C. Sempi, “Copula theory: an introduction,” inCopula theory and its applications. Springer, 2010, pp. 3–31

  18. [19]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInternational conference on machine learning. pmlr, 2015, pp. 448–456

  19. [20]

    Rapid parametric density estimation

    J. Duda, “Rapid parametric density estimation,”arXiv preprint arXiv:1702.02144, 2017

  20. [21]

    Improving kan with cdf normalization to quantiles,

    J. Strawa and J. Duda, “Improving kan with cdf normalization to quantiles,” arXiv preprint arXiv:2507.13393, 2025

  21. [22]

    Credibility evaluation of income data with hierarchical correlation reconstruction

    J. Duda and A. Szulc, “Social benefits versus monetary and multidi- mensional poverty in Poland: imputed income exercise,” inInternational Conference on Applied Economics. Springer, 2019, pp. 87–102, preprint: arXiv:1812.08040

  22. [23]

    Tensor decompositions and applications,

    T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009

  23. [24]

    Fast optimization of common basis for matrix set through common singular value decomposition,

    J. Duda, “Fast optimization of common basis for matrix set through common singular value decomposition,”arXiv preprint arXiv:2204.08242, 2022

  24. [25]

    Linear cost mutual information estimation and independence test of similar performance as hsic,

    J. Duda, J. Bracha, and A. Przybysz, “Linear cost mutual information estimation and independence test of similar performance as hsic,”arXiv preprint arXiv:2508.18338, 2025

  25. [26]

    A kernel statistical test of independence,

    A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Sch ¨olkopf, and A. Smola, “A kernel statistical test of independence,”Advances in neural information processing systems, vol. 20, 2007

  26. [27]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017