pith. sign in

arxiv: 2501.09238 · v2 · submitted 2025-01-16 · 💻 cs.LG

Mono-Forward: Revisiting Forward-Forward through Objective-Locality Decomposition

Pith reviewed 2026-05-23 05:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords backpropagationobjectivegoodnesslocalityalgorithmdecompositionforward-forwardlayer
0
0 comments X

The pith

Forward-Forward's accuracy gap stems from its goodness objective more than locality, as a local cross-entropy replacement closes much of the gap to backpropagation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper separates Forward-Forward into its local forward passes and its contrastive goodness objective to identify which element limits accuracy. Analysis indicates the goodness objective contributes to the shortfall beyond what locality alone would cause. Mono-Forward is introduced as a controlled alternative that retains locality but trains each layer with a standard multi-class cross-entropy loss. Experiments across MLPs, CNNs, and MLP-Mixers show the new method outperforming standard Forward-Forward while matching or exceeding backpropagation on PathMNIST at substantially lower memory cost.

Core claim

The central claim is that Forward-Forward's performance limitations arise not only from locality but also from its positive-negative goodness objective; Mono-Forward, by applying a standard classification objective locally at each layer, preserves locality while delivering stronger results than Forward-Forward and competitive or superior results to backpropagation with reduced memory.

What carries the argument

The objective-locality decomposition that isolates the contribution of the goodness function from layer-wise forward computation, motivating replacement by local cross-entropy.

If this is right

  • MF outperforms vanilla FF across MLPs and convolutional networks.
  • MF remains competitive with multiple FF variants.
  • On MLP-Mixers, MF achieves stronger results on PathMNIST than backpropagation while using 31 percent of the memory.
  • Local learning with a standard objective can achieve memory savings without sacrificing accuracy relative to global backpropagation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern suggests local objectives could scale to deeper networks where backpropagation memory scales poorly.
  • Similar decompositions might isolate objective effects in other local-learning algorithms.
  • Memory reductions could enable larger models or batches on fixed hardware for medical imaging tasks.
  • The result raises the possibility that contrastive objectives are not required for effective layer-wise training.
  • keywords:[

Load-bearing premise

The decomposition cleanly separates the effects of locality from the effects of the goodness objective, allowing attribution of performance differences primarily to the objective.

What would settle it

Replicating the MLP-Mixer PathMNIST experiment and finding that Mono-Forward no longer exceeds backpropagation accuracy or requires more than 31 percent of its memory would falsify the central claim.

Figures

Figures reproduced from arXiv: 2501.09238 by Bruce Li, James Gong, Waleed Abdulla.

Figure 1
Figure 1. Figure 1: Memory Consumed during Training under BP. This experiment utilizes MNIST dataset on a network of size 5 [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory Consumed during Training under MF. This experiment utilizes MNIST dataset on a network of size [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memory Comparison between Backpropagation (BP) and Mono-Forward (MF) during Training. For the MLP [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Convergence Rates between MF and BP. The experiments were conducted using a MLP [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Convergence Rates between MF and BP. The experiments were conducted using a MLP [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Backpropagation remains the dominant algorithm for training deep neural networks, but it incurs substantial memory overhead and relies on global error propagation, which is often regarded as biologically implausible. The Forward-Forward (FF) algorithm is an appealing local-learning alternative to backpropagation, yet it still lags behind backpropagation in accuracy. A central unresolved question is whether this gap arises from FF's locality or from the positive-negative double-pass goodness objective used to train each layer. In this work, we revisit FF under the supervised setting through a decomposition that separates these two design choices. Our analysis suggests that FF's performance limitations are not explained by locality alone, but are also likely influenced by its goodness objective. Motivated by this view, we introduce Mono-Forward (MF), a simplification of FF that preserves its locality while replacing the contrastive goodness objective with a standard multi-class cross-entropy objective applied locally at each layer, serving as a controlled baseline for evaluating local learning under a standard classification objective. Across MLPs and convolutional networks, MF outperforms vanilla FF and remains competitive in multiple FF variants. On MLP-Mixers, MF achieves stronger results on PathMNIST than backpropagation while requiring only 31% of backpropagation's memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper decomposes Forward-Forward (FF) into locality and objective components, introducing Mono-Forward (MF) that preserves per-layer locality but replaces FF's contrastive goodness objective with local multi-class cross-entropy. Empirical comparisons on MLPs, CNNs, and MLP-Mixers show MF outperforming vanilla FF, remaining competitive with backpropagation, and exceeding backprop accuracy on PathMNIST with MLP-Mixers while using 31% of its memory.

Significance. If the decomposition is shown to be unconfounded, the result indicates that FF's performance gap relative to backprop arises in part from the goodness objective rather than locality alone, supplying a simpler local baseline that retains memory advantages. The concrete numbers on standard datasets and the memory reduction on MLP-Mixers would be useful for research on memory-efficient and biologically plausible training methods.

major comments (1)
  1. [Abstract] Abstract: the claim that MF serves as a controlled baseline isolating the goodness objective requires explicit demonstration that its local cross-entropy uses identical per-layer label prediction heads, information pathways, and auxiliary components as FF's goodness function; any mismatch (e.g., explicit classifier heads) would confound attribution of performance differences to the objective type.
minor comments (2)
  1. The experimental protocol, error bars, statistical tests, and exact data splits supporting the reported accuracy and memory figures are not visible in the provided text; these details should be added for reproducibility.
  2. Notation for the local cross-entropy loss and its per-layer application should be defined with an equation in the method section to allow direct comparison with FF's goodness function.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback and the opportunity to clarify the controlled nature of the Mono-Forward baseline. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that MF serves as a controlled baseline isolating the goodness objective requires explicit demonstration that its local cross-entropy uses identical per-layer label prediction heads, information pathways, and auxiliary components as FF's goodness function; any mismatch (e.g., explicit classifier heads) would confound attribution of performance differences to the objective type.

    Authors: We agree that the abstract claim requires stronger grounding. In the full manuscript (Section 3.2 and Algorithm 1), Mono-Forward is constructed by retaining the identical per-layer architecture, label-prediction heads (a linear layer followed by softmax), information pathways (forward-only local computation with no backward pass), and auxiliary components (normalization and stopping criteria) used by the goodness function in the original FF implementation. The sole change is the replacement of the contrastive positive/negative goodness loss with standard multi-class cross-entropy on the local predictions. We will revise the abstract to explicitly state this equivalence and add a short paragraph in Section 3 confirming that all non-objective elements are held fixed, thereby isolating the objective change. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on independent empirical comparisons

full rationale

The paper proposes MF by replacing FF's goodness objective with local multi-class cross-entropy while preserving locality, then reports direct accuracy and memory measurements on standard datasets (MLPs, conv nets, MLP-Mixers on PathMNIST). No equations, fitted parameters, or self-citations are used to derive performance claims; the central conclusion follows from explicit side-by-side runs rather than any reduction of a prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that local cross-entropy constitutes a fair and separable test of locality versus objective choice, plus standard supervised-learning assumptions about dataset validity and optimization convergence. No new free parameters or invented entities are introduced beyond ordinary training hyperparameters.

axioms (1)
  • domain assumption The effects of locality and of the goodness objective can be cleanly separated by the proposed decomposition
    This separation underpins the conclusion that the goodness objective, not locality, explains Forward-Forward's limitations.

pith-pipeline@v0.9.0 · 5749 in / 1349 out tokens · 62669 ms · 2026-05-23T05:52:10.051151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hyperspherical Forward-Forward with Prototypical Representations

    cs.LG 2026-04 unverdicted novelty 7.0

    HFF replaces binary goodness-of-fit in Forward-Forward with hyperspherical prototypes for direct multi-class decisions, enabling single-forward-pass inference and training that scales to ImageNet while closing much of...

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Rumelhart, Geoffrey E

    David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, October 1986

  2. [2]

    Deep learning

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015

  3. [3]

    Decoupled parallel backpropagation with convergence guarantee

    Zhouyuan Huo, Bin Gu, qian Yang, and Heng Huang. Decoupled parallel backpropagation with convergence guarantee. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2098–2106. PMLR, 10–15 Jul 2018

  4. [4]

    Understanding synthetic gradients and decoupled neural interfaces

    Wojciech Czarnecki, Grzegorz Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding synthetic gradients and decoupled neural interfaces. 03 2017

  5. [5]

    Decoupled neural interfaces using synthetic gradients

    Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1627–...

  6. [6]

    Deep sparse rectifier neural networks

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , volume 15 of Proceedings of Machine Learning Research , pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr...

  7. [7]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR

  8. [8]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. pages 770–778, 06 2016

  9. [9]

    Can the brain do backpropagation? -exact implementation of backpropagation in predictive coding networks

    Yuhang Song, Thomas Lukasiewicz, Zhenghua Xu, and Rafal Bogacz. Can the brain do backpropagation? -exact implementation of backpropagation in predictive coding networks. Advances in Neural Information Processing Systems, 33:22566–22579, January 2020

  10. [10]

    The recent excitement about neural networks

    Francis Crick. The recent excitement about neural networks. Nature, 337(6203):129–132, January 1989. 10

  11. [11]

    Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type

    Guoqiang Bi and Mu-ming Poo. Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of neuroscience : the official journal of the Society for Neuroscience, 18:10464–72, 01 1999

  12. [12]

    Sen Song, Kenneth Miller, and L.F. Abbott. Competitive hebbian learning through spike timing-dependent plasticity. Nature neuroscience, 3:919–26, 10 2000

  13. [13]

    Spike timing–dependent plasticity: A hebbian learning rule

    Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: A hebbian learning rule. Annual review of neuroscience, 31:25–46, 02 2008

  14. [14]

    Align, then memorise: the dynamics of learning with feedback alignment, 2021

    Maria Refinetti, Stéphane d’Ascoli, Ruben Ohana, and Sebastian Goldt. Align, then memorise: the dynamics of learning with feedback alignment, 2021

  15. [15]

    Direct feedback alignment provides learning in deep neural networks, 2016

    Arild Nøkland. Direct feedback alignment provides learning in deep neural networks, 2016

  16. [16]

    The forward-forward algorithm: Some preliminary investigations, 2022

    Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations, 2022

  17. [17]

    Direct feedback alignment with sparse connections for local learning, 2019

    Brian Crafton, Abhinav Parihar, Evan Gebhardt, and Arijit Raychowdhury. Direct feedback alignment with sparse connections for local learning, 2019

  18. [18]

    Principled Training of Neural Networks with Direct Feedback Alignment

    Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with direct feedback alignment. ArXiv, abs/1906.04554, 2019

  19. [19]

    Benjamin Scellier and Y . Bengio. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11, 05 2017

  20. [20]

    Updates of equilibrium prop match gradients of backprop through time in an rnn with static input, 2019

    Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Updates of equilibrium prop match gradients of backprop through time in an rnn with static input, 2019

  21. [21]

    Difference target propagation, 2015

    Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation, 2015

  22. [22]

    Simplified neuron model as a principal component analyzer

    Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology , 15(3):267–273, November 1982

  23. [23]

    Forming sparse representations by local anti-hebbian learning

    Peter Földiák. Forming sparse representations by local anti-hebbian learning. Biol Cybern, 64:165–70, 02 1990

  24. [24]

    Layer collaboration in the forward- forward algorithm

    Guy Lorberbom, Itai Gat, Yossi Adi, Alexander Schwing, and Tamir Hazan. Layer collaboration in the forward- forward algorithm. Proceedings of the AAAI Conference on Artificial Intelligence, 38(13):14141–14148, Mar. 2024

  25. [25]

    Cifar-100 (canadian institute for advanced research)

    Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research)

  26. [26]

    Neuroscience-inspired artificial intelligence

    Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95:245–258, 07 2017

  27. [27]

    A deep learning framework for neuroscience

    Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, Colleen J Gillon, Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham, Grace W Lindsay, Kenneth D Miller, Richard Naud, Christopher C Pack, Panayiota Poirazi, Pieter Roelf...

  28. [28]

    O’Reilly

    Randall C. O’Reilly. Six principles for biologically based computational models of cortical cognition. Trends in Cognitive Sciences, 2(11):455–462, 1998

  29. [29]

    Thomas Carmichael, Dale Corbett, and George Wittenberg

    John Krakauer, S. Thomas Carmichael, Dale Corbett, and George Wittenberg. Getting neurorehabilitation right: What can be learned from animal models? Neurorehabilitation and neural repair, 26:923–31, 03 2012

  30. [30]

    Assessment of aphasia and related disorders

    Martha Taylor Sarno. Assessment of aphasia and related disorders. Physical Therapy, 53(2):225–226, 02 1973

  31. [31]

    Polsky, a., mel, b.w

    Alon Poleg-Polsky, Bartlett Mel, and Jackie Schiller. Polsky, a., mel, b.w. & schiller, j. computational subunits in thin dendrites of pyramidal cells. nat. neurosci. 7, 621-627. Nature neuroscience, 7:621–7, 07 2004

  32. [32]

    Spruston n

    Nelson Spruston. Spruston n. pyramidal neurons: dendritic structure and synaptic integration. nat rev neurosci 9: 206-221. Nature reviews. Neuroscience, 9:206–21, 04 2008

  33. [33]

    Biophysics of Computation: Information Processing in Single Neurons

    Christof Koch. Biophysics of Computation: Information Processing in Single Neurons. Oxford University Press, 11 1998

  34. [34]

    Dendritic computation

    Michael London and Michael Häusser. Dendritic computation. Annual review of neuroscience, 28:503–32, 2005

  35. [35]

    The mnist database of handwritten digit images for machine learning research

    Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012. 11

  36. [36]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmark- ing machine learning algorithms, 2017. cite arxiv:1708.07747Comment: Dataset is freely available at https://github.com/zalandoresearch/fashion-mnist Benchmark is available at http://fashion-mnist.s3-website.eu- central-1.amazonaws.com/

  37. [37]

    Cifar-10 (canadian institute for advanced research)

    Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research)

  38. [38]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  39. [39]

    Symba: Symmetric backpropagation-free contrastive learning with forward-forward algorithm for optimizing convergence, 2023

    Heung-Chang Lee and Jeonggeun Song. Symba: Symmetric backpropagation-free contrastive learning with forward-forward algorithm for optimizing convergence, 2023

  40. [40]

    Efficient learning of sparse representations with an energy-based model

    Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann Lecun. Efficient learning of sparse representations with an energy-based model. 01 2006

  41. [41]

    Hinton, Simon Osindero, and Yee-Whye Teh

    Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006

  42. [42]

    Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49. IEEE Press, 2016

  43. [43]

    Gpu memory management, 2020

    NVIDIA. Gpu memory management, 2020. NVIDIA Developer Documentation

  44. [44]

    Santander customer transaction prediction, 2018

    Santander. Santander customer transaction prediction, 2018

  45. [45]

    Dua and C

    D. Dua and C. Graff. Uci machine learning repository: Breast cancer wisconsin (diagnostic) data set, 2017

  46. [46]

    Learning word vectors for sentiment analysis

    Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, 2011. 12