arxiv: 2604.13081 · v2 · submitted 2026-03-28 · 💻 cs.LG · cs.AI· cs.NE

Recognition: 2 theorem links

· Lean Theorem

Selectivity and Shape in the Design of Forward-Forward Goodness Functions

Talha Ruzgar Akkus , Suayp Talha Kocabay , Kamer Ali Yuksel , Hassan Sawaf

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords forward-forwardgoodness functionlocal learningheavy-tailed distributionsselectivityburstinessneural networksimage classification

0 comments

The pith

Goodness functions in Forward-Forward networks work best when they focus on the shape of neural activity rather than its total energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The Forward-Forward algorithm trains networks layer by layer with a local goodness function, yet only sum-of-squares had been examined. The paper demonstrates that this function must instead respond to the shape of activity patterns, for example by selecting only the strongest units or by measuring how heavy-tailed the distribution is. The change is motivated by the observation that network activations are heavy-tailed and that useful signals sit in the peaks. Experiments on thirteen variants across six datasets show consistent accuracy gains, including 98.2 percent on MNIST and 89 percent on Fashion-MNIST.

Core claim

The paper establishes that a goodness function must be sensitive to the shape of neural activity, not its total energy, because deep network activations follow heavy-tailed distributions where discriminative information concentrates in peak activities. Selective functions such as top-k and entmax-weighted energy capture only the strongest units, while shape-sensitive functions based on excess kurtosis and higher moments reward heavy-tailed statistics in a scale-invariant way.

What carries the argument

The burstiness statistic based on excess kurtosis, which measures the heavy-tailed character of activity distributions in a scale-invariant manner.

If this is right

Networks reach 98.2 percent accuracy on MNIST with a four-layer 2000-unit model, a gain of more than thirty points over sum-of-squares.
Comparable gains appear on USPS, SVHN, Fashion-MNIST and other benchmarks and remain stable across five activation functions.
The scale-invariant character of burstiness keeps performance consistent even when activity magnitudes change between layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Local learning rules may succeed more broadly once they incorporate selectivity for peaked activity rather than averaging across all units.
The same shape-sensitive principle could be tested on non-image data where heavy-tailed activations are expected.
Combining these goodness functions with other local training methods might narrow the remaining performance gap to backpropagation on harder tasks.

Load-bearing premise

Deep network activations follow heavy-tailed distributions and discriminative information is often concentrated in peak activities.

What would settle it

Running the same experiments on a dataset engineered to produce light-tailed Gaussian-like activations and finding no accuracy advantage for shape-sensitive functions over sum-of-squares would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.13081 by Hassan Sawaf, Kamer Ali Yuksel, Suayp Talha Kocabay, Talha Ruzgar Akkus.

**Figure 1.** Figure 1: Three sweep axes on Fashion-MNIST (4×2000). All three trace an inverted-U. (a) ksweep: FFCL robust (<2pp). (b) α-sweep: peaks at α≈1.5. (c) Moment-p: peaks at p≈5–6 (89.04% FFCL). 4.5 Goodness × Activation Interaction [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

The Forward-Forward (FF) algorithm trains networks layer-by-layer using a local "goodness function," yet sum-of-squares (SoS) has remained the only choice studied. We systematically explore the goodness-function design space and identify a unifying principle: the goodness function must be sensitive to the shape of neural activity, not its total energy. This principle is motivated by the observation that deep network activations follow heavy-tailed distributions and that discriminative information is often concentrated in peak activities. We propose two complementary families: selective functions (top-k, entmax-weighted energy) that measure only peak activity, and shape-sensitive functions (excess kurtosis / "burstiness" and higher-order moments) that reward heavy-tailed distributions via scale-invariant statistics. Combined with separate label-feature forwarding (FFCL), controlled experiments across 13 goodness functions, 5 activations, 6 datasets, and three continuous sweeps, each tracing a characteristic inverted-U, yield 89.0% on Fashion-MNIST and 98.2+-0.1% on MNIST (4x2000), a +32.6pp gain over SoS, with consistent improvements across all benchmarks (+72pp USPS, +52pp SVHN). The scale-invariant nature of burstiness makes it particularly robust to magnitude shifts across layers and datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that selective peak-focused and kurtosis-based goodness functions beat sum-of-squares in Forward-Forward by large margins on several vision sets, but the claimed shape-sensitivity mechanism lacks direct supporting measurements.

read the letter

The main thing to know is that this work replaces the usual sum-of-squares goodness with two new families—selective functions that look only at top-k or entmax-weighted peaks, and shape-sensitive ones based on excess kurtosis and higher moments—and reports consistent gains across six datasets, including +32 points on MNIST and even larger jumps on USPS and SVHN. The inverted-U parameter sweeps are a solid empirical move and the scale-invariance argument for burstiness is practically useful when layer magnitudes differ. They also pair the new functions with separate label-feature forwarding, which seems to help. What they did well is map out a wider design space than prior FF papers and show that the improvements hold across multiple activations and datasets rather than one lucky case. The numbers, if they replicate, would make FF look more competitive on standard benchmarks. The soft spot is that the central claim—discriminative power lives in the heavy tails and peaks—rests on general observations rather than direct tests on the trained networks. There are no reported per-layer kurtosis values, no fraction of energy in top-k units, and no mutual-information checks between peak activity and labels. Without those, or at least code and error bars, the gains could come from the forwarding trick, activation choice, or implicit regularization instead of the shape principle. The abstract gives no implementation details or statistical tests, so the 98.2 % MNIST figure is hard to evaluate. This paper is for people already working on local learning rules or hardware-friendly training who want concrete alternatives to backprop-style global signals. A reader who cares about empirical sweeps in FF variants will find the controlled experiments useful. It deserves a serious referee because the design space is new and the reported effect sizes are large enough to be worth verifying. I would send it to review but flag the need for code release and mechanism ablations.

Referee Report

3 major / 2 minor

Summary. The paper claims that Forward-Forward goodness functions must be sensitive to the shape of neural activity (via selective top-k/entmax or kurtosis-based burstiness) rather than total energy, motivated by heavy-tailed activations concentrating discriminative information in peaks. It reports large empirical gains (98.2% MNIST, 89% Fashion-MNIST, +32.6pp over SoS) across 13 functions, 6 datasets, and inverted-U sweeps when combined with FFCL.

Significance. If the mechanism is confirmed, this supplies a useful design principle for local goodness functions in layer-wise training, with scale-invariant burstiness offering robustness across layers. The systematic exploration of the design space and consistent sweep patterns are strengths.

major comments (3)

Abstract and Experiments: the central claim that gains arise specifically from shape sensitivity to heavy-tailed peaks lacks direct support; no per-layer kurtosis, top-k energy fraction, or mutual information between peak vs. bulk activity and labels are measured on the trained networks.
Results: reported gains (+72pp USPS, +52pp SVHN) and error bars (e.g., 98.2+-0.1%) are given without baseline code, error-bar methodology, or statistical tests, preventing verification of the improvements over SoS.
Discussion: the inverted-U sweeps are consistent with the hypothesis but do not isolate shape sensitivity from confounds such as FFCL forwarding or activation choice; alternative explanations remain viable.

minor comments (2)

Abstract: the mention of '5 activations' and 'three continuous sweeps' lacks specification of which activations or swept parameters produce the inverted-U traces.
Methods: implementation details, exact hyperparameter settings, and pseudocode for the 13 goodness functions are absent, hindering reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review highlighting areas where additional evidence and verifiability would strengthen the manuscript. We address each major comment below with clarifications based on the existing experiments and propose targeted revisions.

read point-by-point responses

Referee: Abstract and Experiments: the central claim that gains arise specifically from shape sensitivity to heavy-tailed peaks lacks direct support; no per-layer kurtosis, top-k energy fraction, or mutual information between peak vs. bulk activity and labels are measured on the trained networks.

Authors: We agree that direct post-hoc measurements (per-layer kurtosis, top-k energy fractions, and label mutual information on peak vs. bulk activity) would provide stronger causal support for the shape-sensitivity hypothesis. Our current evidence is indirect but systematic: the largest gains occur precisely for the selective and kurtosis-based families, with consistent inverted-U patterns across all datasets when total-energy baselines are controlled. In revision we will add a new subsection in Experiments reporting these statistics on the trained networks for the top-performing goodness functions, directly addressing the gap. revision: yes
Referee: Results: reported gains (+72pp USPS, +52pp SVHN) and error bars (e.g., 98.2+-0.1%) are given without baseline code, error-bar methodology, or statistical tests, preventing verification of the improvements over SoS.

Authors: The reported error bars reflect standard deviation over five independent runs with distinct random seeds; the methodology will be expanded in the revised Experiments section. We will also add paired t-tests confirming statistical significance of the gains over SoS. Full baseline code and reproduction scripts will be released upon acceptance (currently withheld only for anonymity). These changes directly resolve the verifiability concern. revision: yes
Referee: Discussion: the inverted-U sweeps are consistent with the hypothesis but do not isolate shape sensitivity from confounds such as FFCL forwarding or activation choice; alternative explanations remain viable.

Authors: The sweeps were performed with FFCL and activation fixed while varying only the goodness function, which isolates the shape vs. energy distinction within each setting. We nevertheless acknowledge that explicit discussion of remaining confounds is warranted. In revision we will expand the Discussion to enumerate these alternatives and add a short ablation table showing performance when FFCL is removed, thereby clarifying the contribution of each component. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical exploration of goodness functions

full rationale

The paper conducts a controlled empirical study of 13 goodness functions across 5 activations, 6 datasets, and continuous sweeps, reporting performance gains from selective and shape-sensitive functions over sum-of-squares. The unifying principle is presented as a motivation from general observations of heavy-tailed activations rather than a derivation that reduces to fitted inputs or self-citations by construction. No equations, predictions, or uniqueness claims are shown to collapse to the paper's own parameters or prior self-work; results are outcomes of explicit experimental sweeps with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on two domain assumptions about activation statistics plus the unreported experimental protocol; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Deep network activations follow heavy-tailed distributions
Stated motivation for shape sensitivity in the abstract
domain assumption Discriminative information is often concentrated in peak activities
Stated motivation for selective functions in the abstract

pith-pipeline@v0.9.0 · 5550 in / 1227 out tokens · 53912 ms · 2026-05-14T21:56:33.547800+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the goodness function must be sensitive to the shape of neural activity, not its total energy... burstiness goodness (excess kurtosis) ... scale-invariant statistics
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

gburst(h) = 1/d Σ(hi−μ)⁴ / [1/d Σ(hi−μ)²]² − 3 ... gburst(αh) = gburst(h) for any α>0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 9 internal anchors

[1]

How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

URLhttps://arxiv.org/abs/1903.11257. Fred Attneave, M. B., and Donald Olding Hebb. The organization of behavior: A neuropsychological theory

work page internal anchor Pith review Pith/arXiv arXiv 1903
[2]

Gonçalo M

doi: 10.1109/IJCNN.2017.7966217. Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. Adaptively sparse transformers,

work page doi:10.1109/ijcnn.2017.7966217 2017
[3]

Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli

URLhttps://arxiv.org/abs/1909.00015. Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli. Investigating random variations of the forward-forward algorithm for training neural networks. In2023 International Joint Conference on Neural Networks (IJCNN), pages 1–7,

work page arXiv 1909
[4]

Logprompt: A log-based anomaly detection framework using prompts

doi: 10.1109/IJCNN54540.2023. 10191727. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),

work page doi:10.1109/ijcnn54540.2023 2023
[5]

Gaussian Error Linear Units (GELUs)

URL https://arxiv. org/abs/1606.08415. Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https://arxiv.org/abs/2212.13345. J.J. Hull. A database for handwritten text recognition research.IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554,

work page arXiv
[7]

doi: 10.1109/34.291440. A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4):411–430,

work page doi:10.1109/34.291440
[8]

org/abs/2405.12443

URL https://arxiv. org/abs/2405.12443. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization,

work page arXiv
[9]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/1412.6980. Alex Krizhevsky. Learning multiple layers of features from tiny images.University of Toronto, 05

work page internal anchor Pith review Pith/arXiv arXiv
[10]

1998 , month = nov, journal =

doi: 10.1109/5.726791. Heung-Chang Lee and Jeonggeun Song. Symba: Symmetric backpropagation-free contrastive learning with forward-forward algorithm for optimizing convergence,

work page doi:10.1109/5.726791
[11]

URL https:// arxiv.org/abs/2303.08418. J. E. Lisman. Bursts as a unit of neural information: making unreliable synapses reliable.Trends in Neurosciences, 20(1):38–43, jan

work page arXiv
[12]

Guy Lorberbom, Itai Gat, Yossi Adi, Alex Schwing, and Tamir Hazan

doi: 10.1016/S0166-2236(96)10070-9. Guy Lorberbom, Itai Gat, Yossi Adi, Alex Schwing, and Tamir Hazan. Layer collaboration in the forward-forward algorithm,

work page doi:10.1016/s0166-2236(96)10070-9
[13]

URLhttps://arxiv.org/abs/2305.12393. W. Maass. On the computational power of winner-take-all.Neural Computation, 12(11):2519–2535, nov

work page arXiv
[14]

10 Charles H

doi: 10.1162/089976600300014827. 10 Charles H. Martin and Michael W. Mahoney. Traditional and heavy-tailed self regularization in neural network models,

work page doi:10.1162/089976600300014827
[15]

URLhttps://arxiv.org/abs/1901.08276. André F. T. Martins and Ramón Fernandez Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[16]

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

URLhttps://arxiv.org/abs/1602.02068. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural images with unsupervised feature learning.NIPS, 01

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Nature 381, 607–609

ISSN 1476-4687. doi: 10.1038/381607a0. URLhttps://doi.org/10.1038/381607a0. Alexander Ororbia and Ankur Mali. The predictive forward-forward algorithm,

work page doi:10.1038/381607a0
[18]

Ben Peters, Vlad Niculae, and André F

URL https: //arxiv.org/abs/2301.01452. Ben Peters, Vlad Niculae, and André F. T. Martins. Sparse sequence-to-sequence models,

work page arXiv
[19]

Sparse Sequence-to-Sequence Models

URL https://arxiv.org/abs/1905.05702. Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[20]

Searching for Activation Functions

URL https://arxiv.org/abs/1710.05941. Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy- based models and backpropagation,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

URLhttps://arxiv.org/abs/1602.05179. Arya Shah and Vaibhav Tripathi. In search of goodness: Large scale benchmarking of goodness func- tions for the forward-forward algorithm,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Han Xiao, Kashif Rasul, and Roland V ollgraf

URLhttps://arxiv.org/abs/2511.18567. Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,

work page arXiv
[23]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

URLhttps://arxiv.org/abs/1708.07747. Xiaohui Xie and H. Sebastian Seung. Equivalence of backpropagation and contrastive hebbian learning in a layered network.Neural Computation, 15(2):441–454, feb

work page internal anchor Pith review Pith/arXiv arXiv
[24]

doi: 10.1162/ 089976603762552988. 11 A Full Experimental Details Label embedding.Following Hinton [2022], input images are flattened to a 784-dimensional vector and concatenated with a one-hot label vector scaled by s= 5.0 , yielding a 794-dimensional input that is L2-normalized. For FFCL, the input to the first layer uses only the 784-dimensional image (...

work page 2022
[25]

The 2×500 results confirm that the top- k advantage holds at smaller scale

F Results on 2×500 Architecture Table 10 presents results on the smaller 2×500 architecture (standard FF only; FFCL experiments were not run at this scale). The 2×500 results confirm that the top- k advantage holds at smaller scale. On Fashion-MNIST, Swish + top-k achieves 76.65% at 2×500—which exceeds the 4 ×2000 baseline (56.41%) by +20.2pp. This meansa...

work page 2000