arxiv: 2605.00082 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.AI

Recognition: unknown

Hyperspherical Forward-Forward with Prototypical Representations

Shalini Sarode , Brian Moser , Joachim Folz , Federico Raue , Tobias Nauen , Stanislav Frolov , Andreas Dengel

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Forward-Forward algorithmhyperspherical embeddingsprototypical representationslocal learninggreedy layer-wise trainingimage classificationbio-inspired neural networkssingle-pass inference

0 comments

The pith

Reframing each layer's local objective as multi-class classification on class-specific unit-norm prototypes in hyperspherical space allows Forward-Forward networks to train and infer in a single forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the main practical barrier in the Forward-Forward algorithm, namely its requirement for a separate forward pass per class at inference time. It does so by replacing the original binary goodness-of-fit check at each layer with a direct multi-class classification task whose targets are a set of learned unit-norm vectors, one per class, lying on the unit hypersphere. Because these prototypes serve as both positive anchors and implicit negatives, the layer can update its weights and produce a class prediction using only the features it already computes in the forward direction. The resulting architecture keeps the strictly local, greedy training rule while cutting inference cost by more than forty times and raising accuracy enough to exceed 25 percent top-1 on ImageNet-1k.

Core claim

By embedding class-specific unit-norm prototypes in hyperspherical feature space, the local objective at each layer is turned into a multi-class classification problem whose decision boundaries are defined by the angular distances to those prototypes. This single change removes the need for repeated forward passes during inference and permits weight updates and class decisions to occur within the same forward computation, all while preserving the layer-wise independence that distinguishes Forward-Forward from backpropagation.

What carries the argument

The set of class-specific unit-norm prototypes that act as geometric anchors and implicit negatives, converting the per-layer binary goodness signal into a multi-class angular classification task.

If this is right

Inference requires only one forward pass regardless of the number of classes, delivering more than 40 times speedup over the original Forward-Forward algorithm.
The method scales to modern convolutional networks and produces over 25 percent top-1 accuracy on ImageNet-1k using purely local learning.
Transfer learning with the same architecture reaches 65.96 percent accuracy on the same benchmark.
Accuracy on standard image-classification tasks exceeds that of prior greedy local-learning methods while retaining their training advantages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hyperspherical prototype construction could be ported to other local-learning schemes such as predictive coding or equilibrium propagation to obtain similar inference speed-ups.
Because each layer's output is already aligned to class prototypes, intermediate representations may become more directly interpretable without additional post-hoc analysis.
Removing the multi-pass inference bottleneck makes it feasible to run large Forward-Forward models on edge devices where repeated forward evaluations were previously prohibitive.
The same geometric anchoring idea might extend to regression or structured-prediction tasks by replacing class prototypes with task-specific unit vectors.

Load-bearing premise

Learned unit-norm prototypes per class can reliably separate inputs locally at every layer without any global coordination that would break the layer-wise training property.

What would settle it

A controlled experiment that trains the same convolutional architecture with HFF on ImageNet-1k and records top-1 accuracy below 20 percent together with inference latency no better than the original Forward-Forward baseline would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2605.00082 by Andreas Dengel, Brian Moser, Federico Raue, Joachim Folz, Shalini Sarode, Stanislav Frolov, Tobias Nauen.

**Figure 2.** Figure 2: Layerwise Activation Maps and Predictions in transfer-C100-train. Row 1: An easy sample that was predicted correctly in all layers. Row 2: The model predicted wrong but similarlooking aquatic animals in the early layers before getting the correct label. Row 3: Shows a fault case, but we can interpret the reasoning. The model correctly labeled the red apples but mistook the green apple(L3) for a pear. G.3 … view at source ↗

**Figure 3.** Figure 3: UMAP of learned prototypes of C10 20 [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: UMAP of learned prototypes of C10 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

read the original abstract

The Forward-Forward (FF) algorithm presents a compelling, bio-inspired alternative to backpropagation. However, while efficient in training, it has a computationally prohibitive inference process that requires a separate forward pass for every class that is evaluated. In this work, we introduce the Hyperspherical Forward-Forward (HFF), a novel reformulation that resolves this critical bottleneck. Our core innovation is to reframe the local objective of each layer from a binary goodness-of-fit task to a direct multi-class classification problem within a hyperspherical feature space. We achieve this by learning a set of class-specific, unit-norm prototypes that act as geometric anchors and implicit negatives. This architectural innovation preserves the benefits of local training while enabling weight update and inference in a single forward pass, making it >40x faster than the original FF algorithm. Our method is simple to implement, scales effectively to modern convolutional architectures, and achieves superior accuracy on standard image classification benchmarks, closing the gap with backpropagation. Most notably, we are among the first greedy local-learning methods to report over 25% top-1 accuracy on ImageNet-1k, and 65.96% with transfer learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HFF swaps FF's binary local objective for hyperspherical multi-class classification with learned prototypes, fixing the multi-pass inference problem while keeping training layer-wise.

read the letter

This paper's main move is to replace the binary local objective in Forward-Forward with a multi-class classification setup on the hypersphere, where each class has a learned unit-norm prototype that serves as both anchor and implicit negative. That change lets them run inference in a single forward pass instead of one per class, which they say gives more than 40 times speedup. What stands out is that this keeps the training strictly layer-wise and greedy while handling modern conv architectures. They report over 25% top-1 on ImageNet-1k and 66% with transfer learning, which puts it ahead of most other local-learning approaches on that scale. The soft spot is whether the prototype mechanism really stays local. The prototypes are class-specific, so their updates must come only from the current layer's activations and the input label, with no inter-layer sharing or global statistics. If that's true, the locality claim holds; if the implementation sneaks in any coordination, the single-pass benefit and the bio-plausibility both weaken. The abstract gives concrete numbers but no error bars or ablation tables, so it's not yet clear how robust the gains are or how much the results depend on specific hyperparameter choices. This work is aimed at researchers looking for hardware-friendly or biologically motivated training methods that can scale beyond toy problems. The idea is solid enough and the empirical reach is far enough that it should go to peer review rather than a desk reject. The reviewers can check the locality proof and run the numbers themselves.

Referee Report

3 major / 2 minor

Summary. The paper proposes Hyperspherical Forward-Forward (HFF), a reformulation of the Forward-Forward algorithm that replaces the binary goodness-of-fit objective with a multi-class classification loss in a hyperspherical feature space. Class-specific unit-norm prototypes serve as geometric anchors and implicit negatives, enabling both training and inference in a single forward pass per layer while preserving layer-wise locality. The method reports >40x speedup over original FF, competitive accuracies on standard benchmarks, and notably >25% top-1 accuracy on ImageNet-1k (65.96% with transfer learning).

Significance. If the locality of prototype updates and the empirical gains are confirmed, this would represent a meaningful advance for greedy local-learning methods, demonstrating that they can scale to ImageNet-scale tasks with practical efficiency. The hyperspherical prototype approach offers a clean way to embed class structure directly into local objectives without backpropagation.

major comments (3)

[§3] §3 (Method, local objective definition): The central claim that prototypes act as reliable implicit negatives while keeping the objective strictly local requires an explicit update rule. The text describes prototypes as learned class-specific unit-norm vectors but does not show whether their optimization uses only the current layer's activations and the input label or whether it incorporates any inter-layer statistics or shared initialization that would violate the single-forward-pass property.
[Experimental results] Experimental results (ImageNet-1k table): The reported >25% top-1 accuracy is presented without error bars, number of runs, or ablations on hyperspherical dimension and prototype count. This makes it impossible to determine whether the result is robust or sensitive to post-hoc choices, directly affecting the claim of closing the gap to backpropagation for greedy local methods.
[§4] §4 (Inference and speedup): The >40x speedup is attributed to single-pass inference, but the manuscript does not report wall-clock timings or FLOPs on identical hardware and architectures for both HFF and the original FF baseline. Without these controls, the practical efficiency gain cannot be separated from implementation details.

minor comments (2)

Notation for the hyperspherical projection and cosine similarity in the loss should be defined once in a preliminary section rather than reintroduced inline.
The abstract states 'among the first' greedy methods on ImageNet; a brief related-work paragraph citing the closest prior local-learning ImageNet results would strengthen this positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and indicating revisions made to improve the presentation and rigor of the work.

read point-by-point responses

Referee: [§3] §3 (Method, local objective definition): The central claim that prototypes act as reliable implicit negatives while keeping the objective strictly local requires an explicit update rule. The text describes prototypes as learned class-specific unit-norm vectors but does not show whether their optimization uses only the current layer's activations and the input label or whether it incorporates any inter-layer statistics or shared initialization that would violate the single-forward-pass property.

Authors: We agree that an explicit update rule is essential for verifying locality. The original manuscript described the role of prototypes but did not include the precise optimization equation. In the revised Section 3, we now provide the update rule, which operates exclusively on the current layer's activations and the input batch labels. No inter-layer statistics or shared parameters across layers are used, confirming that the single-forward-pass property is preserved. revision: yes
Referee: Experimental results (ImageNet-1k table): The reported >25% top-1 accuracy is presented without error bars, number of runs, or ablations on hyperspherical dimension and prototype count. This makes it impossible to determine whether the result is robust or sensitive to post-hoc choices, directly affecting the claim of closing the gap to backpropagation for greedy local methods.

Authors: We acknowledge the value of statistical reporting and ablations. ImageNet-1k experiments were limited to single runs due to computational demands. In the revision, we have added a note on this limitation in the experimental section, included ablations on hyperspherical dimension and prototype count from smaller datasets in the appendix (showing stable performance), and tempered the claim language regarding the gap to backpropagation to reflect the available evidence. revision: partial
Referee: [§4] §4 (Inference and speedup): The >40x speedup is attributed to single-pass inference, but the manuscript does not report wall-clock timings or FLOPs on identical hardware and architectures for both HFF and the original FF baseline. Without these controls, the practical efficiency gain cannot be separated from implementation details.

Authors: The speedup derives from reducing inference from one forward pass per class to a single pass. We have added an asymptotic FLOPs comparison in the revised Section 4 for the same architectures. We did not conduct new wall-clock timings on identical hardware for this revision, as original FF baselines can vary by implementation; a note has been added acknowledging this while emphasizing the inherent reduction in passes. revision: partial

Circularity Check

0 steps flagged

No circularity: algorithmic redesign with empirical validation

full rationale

The paper introduces HFF by reframing each layer's local objective as multi-class classification over learned unit-norm prototypes in hyperspherical space, enabling single-pass inference. No equations, predictions, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Performance claims (e.g., >25% ImageNet-1k top-1, >40x speedup) rest on reported benchmarks rather than any self-referential derivation. The method is a self-contained architectural change to the original FF algorithm.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that hyperspherical geometry plus learned prototypes can replace binary goodness without loss of local training properties; no external axioms or machine-checked proofs are invoked.

free parameters (1)

class prototypes
Unit-norm vectors per class that are learned as part of the local objective; their number equals the number of classes and they are fitted during training.

invented entities (1)

hyperspherical feature space with prototypes no independent evidence
purpose: To turn binary local goodness into direct multi-class decisions via geometric distance
New framing introduced to enable single-pass inference; no independent evidence outside the method itself is provided.

pith-pipeline@v0.9.0 · 5525 in / 1311 out tokens · 26320 ms · 2026-05-09T20:29:23.752554+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
cs.LG 2026-05 unverdicted novelty 6.0

Cumulative-goodness Forward-Forward networks exhibit layer free-riding where discrimination gradients decay exponentially with prior positive margins; per-block, hardness-gated, and depth-scaled remedies yield 4-45x b...

Reference graph

Works this paper leans on

71 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Nature , year=

Learning representations by back-propagating errors , author=. Nature , year=
[2]

IEEE/IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S) , year=

Bridging the Gap: A Study of AI-based Vulnerability Management between Industry and Academia , author=. IEEE/IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S) , year=
[3]

Nature , year=

AI’s computing gap: academics lack access to powerful chips needed for research , author=. Nature , year=
[4]

Self-supervised learning from images with a joint-embedding predictive architecture , author=
[5]

arXiv , year=

Dinov2: Learning robust visual features without supervision , author=. arXiv , year=
[6]

International Workshop on Biologically Inspired Approaches to Advanced Information Technology , year=

Biologically plausible speech recognition with LSTM neural nets , author=. International Workshop on Biologically Inspired Approaches to Advanced Information Technology , year=
[7]

Neural Computation , year=

Long short-term memory , author=. Neural Computation , year=
[8]

Journal of Statistical Mechanics: Theory and Experiment , year=

Deep double descent: Where bigger models and more data hurt , author=. Journal of Statistical Mechanics: Theory and Experiment , year=
[9]

Adding conditional control to text-to-image diffusion models , author=
[10]

, author=

Lora: Low-rank adaptation of large language models. , author=
[11]

Proxy anchor loss for deep metric learning , author=
[12]

arXiv , year=

Very deep convolutional networks for large-scale image recognition , author=. arXiv , year=
[13]

Sphereface: Deep hypersphere embedding for face recognition , author=
[14]

Frontiers in Computational Neuroscience , year=

Equilibrium propagation: Bridging the gap between energy-based models and backpropagation , author=. Frontiers in Computational Neuroscience , year=
[15]

arXiv , year=

LightFF: Lightweight inference for forward-forward algorithm , author=. arXiv , year=
[16]

Assessing the scalability of biologically-motivated deep learning algorithms and architectures , author=
[17]

Nature , year=

The recent excitement about neural networks , author=. Nature , year=
[18]

Proceedings of the IEEE , year=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , year=
[19]

Emerging properties in self-supervised vision transformers , author=
[20]

A closer look at prototype classifier for few-shot image classification , author=
[21]

Class prototypes based contrastive learning for classifying multi-label and fine-grained educational videos , author=
[22]

Pip-net: Patch-based intuitive prototypes for interpretable image classification , author=
[23]

Nature Reviews Neuroscience , year=

Backpropagation and the brain , author=. Nature Reviews Neuroscience , year=
[24]

arXiv , year=

A method for stochastic optimization , author=. arXiv , year=
[25]

Interpretable image classification with adaptive prototype-based vision transformers , author=
[26]

ICML , year=

The power of log-sum-exp: Sequential density ratio matrix estimation for speed-accuracy optimization , author=. ICML , year=
[27]

IEEE Transactions on Circuits and Systems I: Regular Papers , year=

LogSumExp: Efficient Approximate Logarithm Acceleration for Embedded Tractable Probabilistic Reasoning , author=. IEEE Transactions on Circuits and Systems I: Regular Papers , year=
[28]

arXiv , year=

Deep learning using rectified linear units (relu) , author=. arXiv , year=
[29]

arXiv , year=

Improved Stochastic Optimization of LogSumExp , author=. arXiv , year=
[30]

IEEE Transactions on Neural Networks , year=

Learning long-term dependencies with gradient descent is difficult , author=. IEEE Transactions on Neural Networks , year=
[31]

arXiv , year=

Scaling laws for neural language models , author=. arXiv , year=
[32]

IEEE Signal Processing Magazine , year=

The mnist database of handwritten digit images for machine learning research [best of the web] , author=. IEEE Signal Processing Magazine , year=
[33]

Can the brain do backpropagation?---exact implementation of backpropagation in predictive coding networks , author=
[34]

Trends in Cognitive Sciences , year=

Theories of error back-propagation in the brain , author=. Trends in Cognitive Sciences , year=
[35]

Imagenet: A large-scale hierarchical image database , author=
[36]

CIFAR-100 (Canadian Institute for Advanced Research) , journal=

Alex Krizhevsky and Vinod Nair and Geoffrey Hinton , keywords=. CIFAR-100 (Canadian Institute for Advanced Research) , journal=
[37]

arXiv , year=

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv , year=
[38]

Nature reviews neuroscience , year=

The free-energy principle: a unified brain theory? , author=. Nature reviews neuroscience , year=
[39]

Attention is all you need , author=
[40]

Scientific Reports , year=

Training convolutional neural networks with the Forward--Forward Algorithm , author=. Scientific Reports , year=
[41]

arXiv , year=

On Advancements of the Forward-Forward Algorithm , author=. arXiv , year=
[42]

Deep residual learning for image recognition , author=
[43]

arXiv , year=

Distance-forward learning: enhancing the forward-forward algorithm towards high-performance on-chip learning , author=. arXiv , year=
[44]

Imagenet classification with deep convolutional neural networks , author=
[45]

A method for stochastic optimization , author=
[46]

Decoupled weight decay regularization , author=
[47]

arXiv , year=

The forward-forward algorithm: Some preliminary investigations , author=. arXiv , year=
[48]

arXiv , year=

Symba: Symmetric backpropagation-free contrastive learning with forward-forward algorithm for optimizing convergence , author=. arXiv , year=
[49]

Symposium on VLSI 2024 , year=

FFCL: forward-forward net with cortical loops, training and inference on edge without Backpropogation , author=. Symposium on VLSI 2024 , year=

2024
[50]

Layer collaboration in the forward-forward algorithm , author=
[51]

arXiv , year=

Training convolutional neural networks with the forward-forward algorithm , author=. arXiv , year=
[52]

Convolutional channel-wise competitive learning for the forward-forward algorithm , author=
[53]

Neural Computation , year=

Predictive coding approximates backprop along arbitrary computation graphs , author=. Neural Computation , year=
[54]

Joint European Conference on Machine Learning and Knowledge Discovery in Databases , year=

Difference target propagation , author=. Joint European Conference on Machine Learning and Knowledge Discovery in Databases , year=
[55]

Nature Communications , year=

Random feedback weights support learning in deep neural networks , author=. Nature Communications , year=
[56]

How important is weight symmetry in backpropagation? , author=
[57]

Direct feedback alignment provides learning in deep neural networks , author=
[58]

Biologically-plausible learning algorithms can scale to large datasets , author=
[59]

Hebbian deep learning without feedback , author=
[60]

arXiv , year=

Scalable Forward-Forward Algorithm , author=. arXiv , year=
[61]

TMLR , year=

The trifecta: Three simple techniques for training deeper forward-forward networks , author=. TMLR , year=
[62]

arXiv , year=

NoProp: Training Neural Networks without Back-propagation or Forward-propagation , author=. arXiv , year=
[63]

ICML , year=

Dendritic Localized Learning: Toward Biologically Plausible Algorithm , author=. ICML , year=
[64]

Frontiers in Computational Neuroscience , year=

Toward an integration of deep learning and neuroscience , author=. Frontiers in Computational Neuroscience , year=
[65]

International Machine Vision and Image Processing Conference (MVIP) , year=

Marginal Contrastive Loss: A Step Forward for Forward-Forward , author=. International Machine Vision and Image Processing Conference (MVIP) , year=
[66]

Frontiers in Neuroscience , year=

Direct feedback alignment with sparse connections for local learning , author=. Frontiers in Neuroscience , year=
[67]

2005 , publisher=

The organization of behavior: A neuropsychological theory , author=. 2005 , publisher=

2005
[68]

Hyperspherical classification with dynamic label-to-prototype assignment , author=
[69]

Mono-Forward: Revisiting Forward-Forward through Objective-Locality Decomposition

Mono-forward: backpropagation-free algorithm for efficient neural network training harnessing local errors , author=. arXiv preprint arXiv:2501.09238 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

International conference on machine learning , pages=

Training neural networks with local error signals , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[71]

International conference on machine learning , pages=

Greedy layerwise learning can scale to imagenet , author=. International conference on machine learning , pages=. 2019 , organization=

2019