pith. machine review for the scientific record. sign in

arxiv: 2604.13081 · v2 · submitted 2026-03-28 · 💻 cs.LG · cs.AI· cs.NE

Recognition: 2 theorem links

· Lean Theorem

Selectivity and Shape in the Design of Forward-Forward Goodness Functions

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords forward-forwardgoodness functionlocal learningheavy-tailed distributionsselectivityburstinessneural networksimage classification
0
0 comments X

The pith

Goodness functions in Forward-Forward networks work best when they focus on the shape of neural activity rather than its total energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The Forward-Forward algorithm trains networks layer by layer with a local goodness function, yet only sum-of-squares had been examined. The paper demonstrates that this function must instead respond to the shape of activity patterns, for example by selecting only the strongest units or by measuring how heavy-tailed the distribution is. The change is motivated by the observation that network activations are heavy-tailed and that useful signals sit in the peaks. Experiments on thirteen variants across six datasets show consistent accuracy gains, including 98.2 percent on MNIST and 89 percent on Fashion-MNIST.

Core claim

The paper establishes that a goodness function must be sensitive to the shape of neural activity, not its total energy, because deep network activations follow heavy-tailed distributions where discriminative information concentrates in peak activities. Selective functions such as top-k and entmax-weighted energy capture only the strongest units, while shape-sensitive functions based on excess kurtosis and higher moments reward heavy-tailed statistics in a scale-invariant way.

What carries the argument

The burstiness statistic based on excess kurtosis, which measures the heavy-tailed character of activity distributions in a scale-invariant manner.

If this is right

  • Networks reach 98.2 percent accuracy on MNIST with a four-layer 2000-unit model, a gain of more than thirty points over sum-of-squares.
  • Comparable gains appear on USPS, SVHN, Fashion-MNIST and other benchmarks and remain stable across five activation functions.
  • The scale-invariant character of burstiness keeps performance consistent even when activity magnitudes change between layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Local learning rules may succeed more broadly once they incorporate selectivity for peaked activity rather than averaging across all units.
  • The same shape-sensitive principle could be tested on non-image data where heavy-tailed activations are expected.
  • Combining these goodness functions with other local training methods might narrow the remaining performance gap to backpropagation on harder tasks.

Load-bearing premise

Deep network activations follow heavy-tailed distributions and discriminative information is often concentrated in peak activities.

What would settle it

Running the same experiments on a dataset engineered to produce light-tailed Gaussian-like activations and finding no accuracy advantage for shape-sensitive functions over sum-of-squares would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.13081 by Hassan Sawaf, Kamer Ali Yuksel, Suayp Talha Kocabay, Talha Ruzgar Akkus.

Figure 1
Figure 1. Figure 1: Three sweep axes on Fashion-MNIST (4×2000). All three trace an inverted-U. (a) k￾sweep: FFCL robust (<2pp). (b) α-sweep: peaks at α≈1.5. (c) Moment-p: peaks at p≈5–6 (89.04% FFCL). 4.5 Goodness × Activation Interaction [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

The Forward-Forward (FF) algorithm trains networks layer-by-layer using a local "goodness function," yet sum-of-squares (SoS) has remained the only choice studied. We systematically explore the goodness-function design space and identify a unifying principle: the goodness function must be sensitive to the shape of neural activity, not its total energy. This principle is motivated by the observation that deep network activations follow heavy-tailed distributions and that discriminative information is often concentrated in peak activities. We propose two complementary families: selective functions (top-k, entmax-weighted energy) that measure only peak activity, and shape-sensitive functions (excess kurtosis / "burstiness" and higher-order moments) that reward heavy-tailed distributions via scale-invariant statistics. Combined with separate label-feature forwarding (FFCL), controlled experiments across 13 goodness functions, 5 activations, 6 datasets, and three continuous sweeps, each tracing a characteristic inverted-U, yield 89.0% on Fashion-MNIST and 98.2+-0.1% on MNIST (4x2000), a +32.6pp gain over SoS, with consistent improvements across all benchmarks (+72pp USPS, +52pp SVHN). The scale-invariant nature of burstiness makes it particularly robust to magnitude shifts across layers and datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Forward-Forward goodness functions must be sensitive to the shape of neural activity (via selective top-k/entmax or kurtosis-based burstiness) rather than total energy, motivated by heavy-tailed activations concentrating discriminative information in peaks. It reports large empirical gains (98.2% MNIST, 89% Fashion-MNIST, +32.6pp over SoS) across 13 functions, 6 datasets, and inverted-U sweeps when combined with FFCL.

Significance. If the mechanism is confirmed, this supplies a useful design principle for local goodness functions in layer-wise training, with scale-invariant burstiness offering robustness across layers. The systematic exploration of the design space and consistent sweep patterns are strengths.

major comments (3)
  1. Abstract and Experiments: the central claim that gains arise specifically from shape sensitivity to heavy-tailed peaks lacks direct support; no per-layer kurtosis, top-k energy fraction, or mutual information between peak vs. bulk activity and labels are measured on the trained networks.
  2. Results: reported gains (+72pp USPS, +52pp SVHN) and error bars (e.g., 98.2+-0.1%) are given without baseline code, error-bar methodology, or statistical tests, preventing verification of the improvements over SoS.
  3. Discussion: the inverted-U sweeps are consistent with the hypothesis but do not isolate shape sensitivity from confounds such as FFCL forwarding or activation choice; alternative explanations remain viable.
minor comments (2)
  1. Abstract: the mention of '5 activations' and 'three continuous sweeps' lacks specification of which activations or swept parameters produce the inverted-U traces.
  2. Methods: implementation details, exact hyperparameter settings, and pseudocode for the 13 goodness functions are absent, hindering reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review highlighting areas where additional evidence and verifiability would strengthen the manuscript. We address each major comment below with clarifications based on the existing experiments and propose targeted revisions.

read point-by-point responses
  1. Referee: Abstract and Experiments: the central claim that gains arise specifically from shape sensitivity to heavy-tailed peaks lacks direct support; no per-layer kurtosis, top-k energy fraction, or mutual information between peak vs. bulk activity and labels are measured on the trained networks.

    Authors: We agree that direct post-hoc measurements (per-layer kurtosis, top-k energy fractions, and label mutual information on peak vs. bulk activity) would provide stronger causal support for the shape-sensitivity hypothesis. Our current evidence is indirect but systematic: the largest gains occur precisely for the selective and kurtosis-based families, with consistent inverted-U patterns across all datasets when total-energy baselines are controlled. In revision we will add a new subsection in Experiments reporting these statistics on the trained networks for the top-performing goodness functions, directly addressing the gap. revision: yes

  2. Referee: Results: reported gains (+72pp USPS, +52pp SVHN) and error bars (e.g., 98.2+-0.1%) are given without baseline code, error-bar methodology, or statistical tests, preventing verification of the improvements over SoS.

    Authors: The reported error bars reflect standard deviation over five independent runs with distinct random seeds; the methodology will be expanded in the revised Experiments section. We will also add paired t-tests confirming statistical significance of the gains over SoS. Full baseline code and reproduction scripts will be released upon acceptance (currently withheld only for anonymity). These changes directly resolve the verifiability concern. revision: yes

  3. Referee: Discussion: the inverted-U sweeps are consistent with the hypothesis but do not isolate shape sensitivity from confounds such as FFCL forwarding or activation choice; alternative explanations remain viable.

    Authors: The sweeps were performed with FFCL and activation fixed while varying only the goodness function, which isolates the shape vs. energy distinction within each setting. We nevertheless acknowledge that explicit discussion of remaining confounds is warranted. In revision we will expand the Discussion to enumerate these alternatives and add a short ablation table showing performance when FFCL is removed, thereby clarifying the contribution of each component. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical exploration of goodness functions

full rationale

The paper conducts a controlled empirical study of 13 goodness functions across 5 activations, 6 datasets, and continuous sweeps, reporting performance gains from selective and shape-sensitive functions over sum-of-squares. The unifying principle is presented as a motivation from general observations of heavy-tailed activations rather than a derivation that reduces to fitted inputs or self-citations by construction. No equations, predictions, or uniqueness claims are shown to collapse to the paper's own parameters or prior self-work; results are outcomes of explicit experimental sweeps with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on two domain assumptions about activation statistics plus the unreported experimental protocol; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Deep network activations follow heavy-tailed distributions
    Stated motivation for shape sensitivity in the abstract
  • domain assumption Discriminative information is often concentrated in peak activities
    Stated motivation for selective functions in the abstract

pith-pipeline@v0.9.0 · 5550 in / 1227 out tokens · 53912 ms · 2026-05-14T21:56:33.547800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 9 internal anchors

  1. [1]

    How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

    URLhttps://arxiv.org/abs/1903.11257. Fred Attneave, M. B., and Donald Olding Hebb. The organization of behavior: A neuropsychological theory

  2. [2]

    Gonçalo M

    doi: 10.1109/IJCNN.2017.7966217. Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. Adaptively sparse transformers,

  3. [3]

    Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli

    URLhttps://arxiv.org/abs/1909.00015. Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli. Investigating random variations of the forward-forward algorithm for training neural networks. In2023 International Joint Conference on Neural Networks (IJCNN), pages 1–7,

  4. [4]

    Logprompt: A log-based anomaly detection framework using prompts

    doi: 10.1109/IJCNN54540.2023. 10191727. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),

  5. [5]

    Gaussian Error Linear Units (GELUs)

    URL https://arxiv. org/abs/1606.08415. Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations,

  6. [6]

    URL https://arxiv.org/abs/2212.13345. J.J. Hull. A database for handwritten text recognition research.IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554,

  7. [7]

    doi: 10.1109/34.291440. A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4):411–430,

  8. [8]

    org/abs/2405.12443

    URL https://arxiv. org/abs/2405.12443. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization,

  9. [9]

    Adam: A Method for Stochastic Optimization

    URL https://arxiv.org/abs/1412.6980. Alex Krizhevsky. Learning multiple layers of features from tiny images.University of Toronto, 05

  10. [10]

    1998 , month = nov, journal =

    doi: 10.1109/5.726791. Heung-Chang Lee and Jeonggeun Song. Symba: Symmetric backpropagation-free contrastive learning with forward-forward algorithm for optimizing convergence,

  11. [11]

    URL https:// arxiv.org/abs/2303.08418. J. E. Lisman. Bursts as a unit of neural information: making unreliable synapses reliable.Trends in Neurosciences, 20(1):38–43, jan

  12. [12]

    Guy Lorberbom, Itai Gat, Yossi Adi, Alex Schwing, and Tamir Hazan

    doi: 10.1016/S0166-2236(96)10070-9. Guy Lorberbom, Itai Gat, Yossi Adi, Alex Schwing, and Tamir Hazan. Layer collaboration in the forward-forward algorithm,

  13. [13]

    URLhttps://arxiv.org/abs/2305.12393. W. Maass. On the computational power of winner-take-all.Neural Computation, 12(11):2519–2535, nov

  14. [14]

    10 Charles H

    doi: 10.1162/089976600300014827. 10 Charles H. Martin and Michael W. Mahoney. Traditional and heavy-tailed self regularization in neural network models,

  15. [15]

    URLhttps://arxiv.org/abs/1901.08276. André F. T. Martins and Ramón Fernandez Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification,

  16. [16]

    From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

    URLhttps://arxiv.org/abs/1602.02068. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural images with unsupervised feature learning.NIPS, 01

  17. [17]

    Nature 381, 607–609

    ISSN 1476-4687. doi: 10.1038/381607a0. URLhttps://doi.org/10.1038/381607a0. Alexander Ororbia and Ankur Mali. The predictive forward-forward algorithm,

  18. [18]

    Ben Peters, Vlad Niculae, and André F

    URL https: //arxiv.org/abs/2301.01452. Ben Peters, Vlad Niculae, and André F. T. Martins. Sparse sequence-to-sequence models,

  19. [19]

    Sparse Sequence-to-Sequence Models

    URL https://arxiv.org/abs/1905.05702. Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions,

  20. [20]

    Searching for Activation Functions

    URL https://arxiv.org/abs/1710.05941. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy- based models and backpropagation,

  21. [21]

    Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

    URLhttps://arxiv.org/abs/1602.05179. Arya Shah and Vaibhav Tripathi. In search of goodness: Large scale benchmarking of goodness func- tions for the forward-forward algorithm,

  22. [22]

    Han Xiao, Kashif Rasul, and Roland V ollgraf

    URLhttps://arxiv.org/abs/2511.18567. Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,

  23. [23]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    URLhttps://arxiv.org/abs/1708.07747. Xiaohui Xie and H. Sebastian Seung. Equivalence of backpropagation and contrastive hebbian learning in a layered network.Neural Computation, 15(2):441–454, feb

  24. [24]

    doi: 10.1162/ 089976603762552988. 11 A Full Experimental Details Label embedding.Following Hinton [2022], input images are flattened to a 784-dimensional vector and concatenated with a one-hot label vector scaled by s= 5.0 , yielding a 794-dimensional input that is L2-normalized. For FFCL, the input to the first layer uses only the 784-dimensional image (...

  25. [25]

    The 2×500 results confirm that the top- k advantage holds at smaller scale

    F Results on 2×500 Architecture Table 10 presents results on the smaller 2×500 architecture (standard FF only; FFCL experiments were not run at this scale). The 2×500 results confirm that the top- k advantage holds at smaller scale. On Fashion-MNIST, Swish + top-k achieves 76.65% at 2×500—which exceeds the 4 ×2000 baseline (56.41%) by +20.2pp. This meansa...