Recognition: 2 theorem links
· Lean TheoremSelectivity and Shape in the Design of Forward-Forward Goodness Functions
Pith reviewed 2026-05-14 21:56 UTC · model grok-4.3
The pith
Goodness functions in Forward-Forward networks work best when they focus on the shape of neural activity rather than its total energy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a goodness function must be sensitive to the shape of neural activity, not its total energy, because deep network activations follow heavy-tailed distributions where discriminative information concentrates in peak activities. Selective functions such as top-k and entmax-weighted energy capture only the strongest units, while shape-sensitive functions based on excess kurtosis and higher moments reward heavy-tailed statistics in a scale-invariant way.
What carries the argument
The burstiness statistic based on excess kurtosis, which measures the heavy-tailed character of activity distributions in a scale-invariant manner.
If this is right
- Networks reach 98.2 percent accuracy on MNIST with a four-layer 2000-unit model, a gain of more than thirty points over sum-of-squares.
- Comparable gains appear on USPS, SVHN, Fashion-MNIST and other benchmarks and remain stable across five activation functions.
- The scale-invariant character of burstiness keeps performance consistent even when activity magnitudes change between layers.
Where Pith is reading between the lines
- Local learning rules may succeed more broadly once they incorporate selectivity for peaked activity rather than averaging across all units.
- The same shape-sensitive principle could be tested on non-image data where heavy-tailed activations are expected.
- Combining these goodness functions with other local training methods might narrow the remaining performance gap to backpropagation on harder tasks.
Load-bearing premise
Deep network activations follow heavy-tailed distributions and discriminative information is often concentrated in peak activities.
What would settle it
Running the same experiments on a dataset engineered to produce light-tailed Gaussian-like activations and finding no accuracy advantage for shape-sensitive functions over sum-of-squares would falsify the central claim.
Figures
read the original abstract
The Forward-Forward (FF) algorithm trains networks layer-by-layer using a local "goodness function," yet sum-of-squares (SoS) has remained the only choice studied. We systematically explore the goodness-function design space and identify a unifying principle: the goodness function must be sensitive to the shape of neural activity, not its total energy. This principle is motivated by the observation that deep network activations follow heavy-tailed distributions and that discriminative information is often concentrated in peak activities. We propose two complementary families: selective functions (top-k, entmax-weighted energy) that measure only peak activity, and shape-sensitive functions (excess kurtosis / "burstiness" and higher-order moments) that reward heavy-tailed distributions via scale-invariant statistics. Combined with separate label-feature forwarding (FFCL), controlled experiments across 13 goodness functions, 5 activations, 6 datasets, and three continuous sweeps, each tracing a characteristic inverted-U, yield 89.0% on Fashion-MNIST and 98.2+-0.1% on MNIST (4x2000), a +32.6pp gain over SoS, with consistent improvements across all benchmarks (+72pp USPS, +52pp SVHN). The scale-invariant nature of burstiness makes it particularly robust to magnitude shifts across layers and datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Forward-Forward goodness functions must be sensitive to the shape of neural activity (via selective top-k/entmax or kurtosis-based burstiness) rather than total energy, motivated by heavy-tailed activations concentrating discriminative information in peaks. It reports large empirical gains (98.2% MNIST, 89% Fashion-MNIST, +32.6pp over SoS) across 13 functions, 6 datasets, and inverted-U sweeps when combined with FFCL.
Significance. If the mechanism is confirmed, this supplies a useful design principle for local goodness functions in layer-wise training, with scale-invariant burstiness offering robustness across layers. The systematic exploration of the design space and consistent sweep patterns are strengths.
major comments (3)
- Abstract and Experiments: the central claim that gains arise specifically from shape sensitivity to heavy-tailed peaks lacks direct support; no per-layer kurtosis, top-k energy fraction, or mutual information between peak vs. bulk activity and labels are measured on the trained networks.
- Results: reported gains (+72pp USPS, +52pp SVHN) and error bars (e.g., 98.2+-0.1%) are given without baseline code, error-bar methodology, or statistical tests, preventing verification of the improvements over SoS.
- Discussion: the inverted-U sweeps are consistent with the hypothesis but do not isolate shape sensitivity from confounds such as FFCL forwarding or activation choice; alternative explanations remain viable.
minor comments (2)
- Abstract: the mention of '5 activations' and 'three continuous sweeps' lacks specification of which activations or swept parameters produce the inverted-U traces.
- Methods: implementation details, exact hyperparameter settings, and pseudocode for the 13 goodness functions are absent, hindering reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review highlighting areas where additional evidence and verifiability would strengthen the manuscript. We address each major comment below with clarifications based on the existing experiments and propose targeted revisions.
read point-by-point responses
-
Referee: Abstract and Experiments: the central claim that gains arise specifically from shape sensitivity to heavy-tailed peaks lacks direct support; no per-layer kurtosis, top-k energy fraction, or mutual information between peak vs. bulk activity and labels are measured on the trained networks.
Authors: We agree that direct post-hoc measurements (per-layer kurtosis, top-k energy fractions, and label mutual information on peak vs. bulk activity) would provide stronger causal support for the shape-sensitivity hypothesis. Our current evidence is indirect but systematic: the largest gains occur precisely for the selective and kurtosis-based families, with consistent inverted-U patterns across all datasets when total-energy baselines are controlled. In revision we will add a new subsection in Experiments reporting these statistics on the trained networks for the top-performing goodness functions, directly addressing the gap. revision: yes
-
Referee: Results: reported gains (+72pp USPS, +52pp SVHN) and error bars (e.g., 98.2+-0.1%) are given without baseline code, error-bar methodology, or statistical tests, preventing verification of the improvements over SoS.
Authors: The reported error bars reflect standard deviation over five independent runs with distinct random seeds; the methodology will be expanded in the revised Experiments section. We will also add paired t-tests confirming statistical significance of the gains over SoS. Full baseline code and reproduction scripts will be released upon acceptance (currently withheld only for anonymity). These changes directly resolve the verifiability concern. revision: yes
-
Referee: Discussion: the inverted-U sweeps are consistent with the hypothesis but do not isolate shape sensitivity from confounds such as FFCL forwarding or activation choice; alternative explanations remain viable.
Authors: The sweeps were performed with FFCL and activation fixed while varying only the goodness function, which isolates the shape vs. energy distinction within each setting. We nevertheless acknowledge that explicit discussion of remaining confounds is warranted. In revision we will expand the Discussion to enumerate these alternatives and add a short ablation table showing performance when FFCL is removed, thereby clarifying the contribution of each component. revision: partial
Circularity Check
No significant circularity; empirical exploration of goodness functions
full rationale
The paper conducts a controlled empirical study of 13 goodness functions across 5 activations, 6 datasets, and continuous sweeps, reporting performance gains from selective and shape-sensitive functions over sum-of-squares. The unifying principle is presented as a motivation from general observations of heavy-tailed activations rather than a derivation that reduces to fitted inputs or self-citations by construction. No equations, predictions, or uniqueness claims are shown to collapse to the paper's own parameters or prior self-work; results are outcomes of explicit experimental sweeps with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Deep network activations follow heavy-tailed distributions
- domain assumption Discriminative information is often concentrated in peak activities
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the goodness function must be sensitive to the shape of neural activity, not its total energy... burstiness goodness (excess kurtosis) ... scale-invariant statistics
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero refines?
refinesRelation between the paper passage and the cited Recognition theorem.
gburst(h) = 1/d Σ(hi−μ)⁴ / [1/d Σ(hi−μ)²]² − 3 ... gburst(αh) = gburst(h) for any α>0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
How Can We Be So Dense? The Benefits of Using Highly Sparse Representations
URLhttps://arxiv.org/abs/1903.11257. Fred Attneave, M. B., and Donald Olding Hebb. The organization of behavior: A neuropsychological theory
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[2]
doi: 10.1109/IJCNN.2017.7966217. Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. Adaptively sparse transformers,
-
[3]
Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli
URLhttps://arxiv.org/abs/1909.00015. Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli. Investigating random variations of the forward-forward algorithm for training neural networks. In2023 International Joint Conference on Neural Networks (IJCNN), pages 1–7,
-
[4]
Logprompt: A log-based anomaly detection framework using prompts
doi: 10.1109/IJCNN54540.2023. 10191727. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),
-
[5]
Gaussian Error Linear Units (GELUs)
URL https://arxiv. org/abs/1606.08415. Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations,
work page internal anchor Pith review Pith/arXiv arXiv
- [6]
-
[7]
doi: 10.1109/34.291440. A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4):411–430,
-
[8]
URL https://arxiv. org/abs/2405.12443. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization,
-
[9]
Adam: A Method for Stochastic Optimization
URL https://arxiv.org/abs/1412.6980. Alex Krizhevsky. Learning multiple layers of features from tiny images.University of Toronto, 05
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
doi: 10.1109/5.726791. Heung-Chang Lee and Jeonggeun Song. Symba: Symmetric backpropagation-free contrastive learning with forward-forward algorithm for optimizing convergence,
- [11]
-
[12]
Guy Lorberbom, Itai Gat, Yossi Adi, Alex Schwing, and Tamir Hazan
doi: 10.1016/S0166-2236(96)10070-9. Guy Lorberbom, Itai Gat, Yossi Adi, Alex Schwing, and Tamir Hazan. Layer collaboration in the forward-forward algorithm,
- [13]
-
[14]
doi: 10.1162/089976600300014827. 10 Charles H. Martin and Michael W. Mahoney. Traditional and heavy-tailed self regularization in neural network models,
-
[15]
URLhttps://arxiv.org/abs/1901.08276. André F. T. Martins and Ramón Fernandez Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification,
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[16]
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
URLhttps://arxiv.org/abs/1602.02068. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural images with unsupervised feature learning.NIPS, 01
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
ISSN 1476-4687. doi: 10.1038/381607a0. URLhttps://doi.org/10.1038/381607a0. Alexander Ororbia and Ankur Mali. The predictive forward-forward algorithm,
-
[18]
Ben Peters, Vlad Niculae, and André F
URL https: //arxiv.org/abs/2301.01452. Ben Peters, Vlad Niculae, and André F. T. Martins. Sparse sequence-to-sequence models,
-
[19]
Sparse Sequence-to-Sequence Models
URL https://arxiv.org/abs/1905.05702. Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[20]
Searching for Activation Functions
URL https://arxiv.org/abs/1710.05941. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy- based models and backpropagation,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation
URLhttps://arxiv.org/abs/1602.05179. Arya Shah and Vaibhav Tripathi. In search of goodness: Large scale benchmarking of goodness func- tions for the forward-forward algorithm,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Han Xiao, Kashif Rasul, and Roland V ollgraf
URLhttps://arxiv.org/abs/2511.18567. Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,
-
[23]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
URLhttps://arxiv.org/abs/1708.07747. Xiaohui Xie and H. Sebastian Seung. Equivalence of backpropagation and contrastive hebbian learning in a layered network.Neural Computation, 15(2):441–454, feb
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
doi: 10.1162/ 089976603762552988. 11 A Full Experimental Details Label embedding.Following Hinton [2022], input images are flattened to a 784-dimensional vector and concatenated with a one-hot label vector scaled by s= 5.0 , yielding a 794-dimensional input that is L2-normalized. For FFCL, the input to the first layer uses only the 784-dimensional image (...
work page 2022
-
[25]
The 2×500 results confirm that the top- k advantage holds at smaller scale
F Results on 2×500 Architecture Table 10 presents results on the smaller 2×500 architecture (standard FF only; FFCL experiments were not run at this scale). The 2×500 results confirm that the top- k advantage holds at smaller scale. On Fashion-MNIST, Swish + top-k achieves 76.65% at 2×500—which exceeds the 4 ×2000 baseline (56.41%) by +20.2pp. This meansa...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.