Mono-Forward: Revisiting Forward-Forward through Objective-Locality Decomposition
Pith reviewed 2026-05-23 05:52 UTC · model grok-4.3
The pith
Forward-Forward's accuracy gap stems from its goodness objective more than locality, as a local cross-entropy replacement closes much of the gap to backpropagation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Forward-Forward's performance limitations arise not only from locality but also from its positive-negative goodness objective; Mono-Forward, by applying a standard classification objective locally at each layer, preserves locality while delivering stronger results than Forward-Forward and competitive or superior results to backpropagation with reduced memory.
What carries the argument
The objective-locality decomposition that isolates the contribution of the goodness function from layer-wise forward computation, motivating replacement by local cross-entropy.
If this is right
- MF outperforms vanilla FF across MLPs and convolutional networks.
- MF remains competitive with multiple FF variants.
- On MLP-Mixers, MF achieves stronger results on PathMNIST than backpropagation while using 31 percent of the memory.
- Local learning with a standard objective can achieve memory savings without sacrificing accuracy relative to global backpropagation.
Where Pith is reading between the lines
- The pattern suggests local objectives could scale to deeper networks where backpropagation memory scales poorly.
- Similar decompositions might isolate objective effects in other local-learning algorithms.
- Memory reductions could enable larger models or batches on fixed hardware for medical imaging tasks.
- The result raises the possibility that contrastive objectives are not required for effective layer-wise training.
- keywords:[
Load-bearing premise
The decomposition cleanly separates the effects of locality from the effects of the goodness objective, allowing attribution of performance differences primarily to the objective.
What would settle it
Replicating the MLP-Mixer PathMNIST experiment and finding that Mono-Forward no longer exceeds backpropagation accuracy or requires more than 31 percent of its memory would falsify the central claim.
Figures
read the original abstract
Backpropagation remains the dominant algorithm for training deep neural networks, but it incurs substantial memory overhead and relies on global error propagation, which is often regarded as biologically implausible. The Forward-Forward (FF) algorithm is an appealing local-learning alternative to backpropagation, yet it still lags behind backpropagation in accuracy. A central unresolved question is whether this gap arises from FF's locality or from the positive-negative double-pass goodness objective used to train each layer. In this work, we revisit FF under the supervised setting through a decomposition that separates these two design choices. Our analysis suggests that FF's performance limitations are not explained by locality alone, but are also likely influenced by its goodness objective. Motivated by this view, we introduce Mono-Forward (MF), a simplification of FF that preserves its locality while replacing the contrastive goodness objective with a standard multi-class cross-entropy objective applied locally at each layer, serving as a controlled baseline for evaluating local learning under a standard classification objective. Across MLPs and convolutional networks, MF outperforms vanilla FF and remains competitive in multiple FF variants. On MLP-Mixers, MF achieves stronger results on PathMNIST than backpropagation while requiring only 31% of backpropagation's memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper decomposes Forward-Forward (FF) into locality and objective components, introducing Mono-Forward (MF) that preserves per-layer locality but replaces FF's contrastive goodness objective with local multi-class cross-entropy. Empirical comparisons on MLPs, CNNs, and MLP-Mixers show MF outperforming vanilla FF, remaining competitive with backpropagation, and exceeding backprop accuracy on PathMNIST with MLP-Mixers while using 31% of its memory.
Significance. If the decomposition is shown to be unconfounded, the result indicates that FF's performance gap relative to backprop arises in part from the goodness objective rather than locality alone, supplying a simpler local baseline that retains memory advantages. The concrete numbers on standard datasets and the memory reduction on MLP-Mixers would be useful for research on memory-efficient and biologically plausible training methods.
major comments (1)
- [Abstract] Abstract: the claim that MF serves as a controlled baseline isolating the goodness objective requires explicit demonstration that its local cross-entropy uses identical per-layer label prediction heads, information pathways, and auxiliary components as FF's goodness function; any mismatch (e.g., explicit classifier heads) would confound attribution of performance differences to the objective type.
minor comments (2)
- The experimental protocol, error bars, statistical tests, and exact data splits supporting the reported accuracy and memory figures are not visible in the provided text; these details should be added for reproducibility.
- Notation for the local cross-entropy loss and its per-layer application should be defined with an equation in the method section to allow direct comparison with FF's goodness function.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback and the opportunity to clarify the controlled nature of the Mono-Forward baseline. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that MF serves as a controlled baseline isolating the goodness objective requires explicit demonstration that its local cross-entropy uses identical per-layer label prediction heads, information pathways, and auxiliary components as FF's goodness function; any mismatch (e.g., explicit classifier heads) would confound attribution of performance differences to the objective type.
Authors: We agree that the abstract claim requires stronger grounding. In the full manuscript (Section 3.2 and Algorithm 1), Mono-Forward is constructed by retaining the identical per-layer architecture, label-prediction heads (a linear layer followed by softmax), information pathways (forward-only local computation with no backward pass), and auxiliary components (normalization and stopping criteria) used by the goodness function in the original FF implementation. The sole change is the replacement of the contrastive positive/negative goodness loss with standard multi-class cross-entropy on the local predictions. We will revise the abstract to explicitly state this equivalence and add a short paragraph in Section 3 confirming that all non-objective elements are held fixed, thereby isolating the objective change. revision: yes
Circularity Check
No circularity; claims rest on independent empirical comparisons
full rationale
The paper proposes MF by replacing FF's goodness objective with local multi-class cross-entropy while preserving locality, then reports direct accuracy and memory measurements on standard datasets (MLPs, conv nets, MLP-Mixers on PathMNIST). No equations, fitted parameters, or self-citations are used to derive performance claims; the central conclusion follows from explicit side-by-side runs rather than any reduction of a prediction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The effects of locality and of the goodness objective can be cleanly separated by the proposed decomposition
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MF employs a specialized approach to calculate the 'goodness' score at each layer... Gi ≜ ai × M⊤i ... Li ≜ −∑ yc log(σ(Gic))
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MF consistently matches or surpasses the accuracy of backpropagation... with significantly reduced memory consumption
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Hyperspherical Forward-Forward with Prototypical Representations
HFF replaces binary goodness-of-fit in Forward-Forward with hyperspherical prototypes for direct multi-class decisions, enabling single-forward-pass inference and training that scales to ImageNet while closing much of...
Reference graph
Works this paper leans on
-
[1]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, October 1986
work page 1986
-
[2]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015
work page 2015
-
[3]
Decoupled parallel backpropagation with convergence guarantee
Zhouyuan Huo, Bin Gu, qian Yang, and Heng Huang. Decoupled parallel backpropagation with convergence guarantee. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2098–2106. PMLR, 10–15 Jul 2018
work page 2098
-
[4]
Understanding synthetic gradients and decoupled neural interfaces
Wojciech Czarnecki, Grzegorz Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding synthetic gradients and decoupled neural interfaces. 03 2017
work page 2017
-
[5]
Decoupled neural interfaces using synthetic gradients
Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1627–...
work page 2017
-
[6]
Deep sparse rectifier neural networks
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , volume 15 of Proceedings of Machine Learning Research , pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr...
work page 2011
-
[7]
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR
work page 2013
-
[8]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. pages 770–778, 06 2016
work page 2016
-
[9]
Yuhang Song, Thomas Lukasiewicz, Zhenghua Xu, and Rafal Bogacz. Can the brain do backpropagation? -exact implementation of backpropagation in predictive coding networks. Advances in Neural Information Processing Systems, 33:22566–22579, January 2020
work page 2020
-
[10]
The recent excitement about neural networks
Francis Crick. The recent excitement about neural networks. Nature, 337(6203):129–132, January 1989. 10
work page 1989
-
[11]
Guoqiang Bi and Mu-ming Poo. Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of neuroscience : the official journal of the Society for Neuroscience, 18:10464–72, 01 1999
work page 1999
-
[12]
Sen Song, Kenneth Miller, and L.F. Abbott. Competitive hebbian learning through spike timing-dependent plasticity. Nature neuroscience, 3:919–26, 10 2000
work page 2000
-
[13]
Spike timing–dependent plasticity: A hebbian learning rule
Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: A hebbian learning rule. Annual review of neuroscience, 31:25–46, 02 2008
work page 2008
-
[14]
Align, then memorise: the dynamics of learning with feedback alignment, 2021
Maria Refinetti, Stéphane d’Ascoli, Ruben Ohana, and Sebastian Goldt. Align, then memorise: the dynamics of learning with feedback alignment, 2021
work page 2021
-
[15]
Direct feedback alignment provides learning in deep neural networks, 2016
Arild Nøkland. Direct feedback alignment provides learning in deep neural networks, 2016
work page 2016
-
[16]
The forward-forward algorithm: Some preliminary investigations, 2022
Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations, 2022
work page 2022
-
[17]
Direct feedback alignment with sparse connections for local learning, 2019
Brian Crafton, Abhinav Parihar, Evan Gebhardt, and Arijit Raychowdhury. Direct feedback alignment with sparse connections for local learning, 2019
work page 2019
-
[18]
Principled Training of Neural Networks with Direct Feedback Alignment
Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with direct feedback alignment. ArXiv, abs/1906.04554, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[19]
Benjamin Scellier and Y . Bengio. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11, 05 2017
work page 2017
-
[20]
Maxence Ernoult, Julie Grollier, Damien Querlioz, Yoshua Bengio, and Benjamin Scellier. Updates of equilibrium prop match gradients of backprop through time in an rnn with static input, 2019
work page 2019
-
[21]
Difference target propagation, 2015
Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation, 2015
work page 2015
-
[22]
Simplified neuron model as a principal component analyzer
Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology , 15(3):267–273, November 1982
work page 1982
-
[23]
Forming sparse representations by local anti-hebbian learning
Peter Földiák. Forming sparse representations by local anti-hebbian learning. Biol Cybern, 64:165–70, 02 1990
work page 1990
-
[24]
Layer collaboration in the forward- forward algorithm
Guy Lorberbom, Itai Gat, Yossi Adi, Alexander Schwing, and Tamir Hazan. Layer collaboration in the forward- forward algorithm. Proceedings of the AAAI Conference on Artificial Intelligence, 38(13):14141–14148, Mar. 2024
work page 2024
-
[25]
Cifar-100 (canadian institute for advanced research)
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research)
-
[26]
Neuroscience-inspired artificial intelligence
Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95:245–258, 07 2017
work page 2017
-
[27]
A deep learning framework for neuroscience
Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, Colleen J Gillon, Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham, Grace W Lindsay, Kenneth D Miller, Richard Naud, Christopher C Pack, Panayiota Poirazi, Pieter Roelf...
work page 2019
- [28]
-
[29]
Thomas Carmichael, Dale Corbett, and George Wittenberg
John Krakauer, S. Thomas Carmichael, Dale Corbett, and George Wittenberg. Getting neurorehabilitation right: What can be learned from animal models? Neurorehabilitation and neural repair, 26:923–31, 03 2012
work page 2012
-
[30]
Assessment of aphasia and related disorders
Martha Taylor Sarno. Assessment of aphasia and related disorders. Physical Therapy, 53(2):225–226, 02 1973
work page 1973
-
[31]
Alon Poleg-Polsky, Bartlett Mel, and Jackie Schiller. Polsky, a., mel, b.w. & schiller, j. computational subunits in thin dendrites of pyramidal cells. nat. neurosci. 7, 621-627. Nature neuroscience, 7:621–7, 07 2004
work page 2004
-
[32]
Nelson Spruston. Spruston n. pyramidal neurons: dendritic structure and synaptic integration. nat rev neurosci 9: 206-221. Nature reviews. Neuroscience, 9:206–21, 04 2008
work page 2008
-
[33]
Biophysics of Computation: Information Processing in Single Neurons
Christof Koch. Biophysics of Computation: Information Processing in Single Neurons. Oxford University Press, 11 1998
work page 1998
-
[34]
Michael London and Michael Häusser. Dendritic computation. Annual review of neuroscience, 28:503–32, 2005
work page 2005
-
[35]
The mnist database of handwritten digit images for machine learning research
Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012. 11
work page 2012
-
[36]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmark- ing machine learning algorithms, 2017. cite arxiv:1708.07747Comment: Dataset is freely available at https://github.com/zalandoresearch/fashion-mnist Benchmark is available at http://fashion-mnist.s3-website.eu- central-1.amazonaws.com/
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Cifar-10 (canadian institute for advanced research)
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research)
-
[38]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[39]
Heung-Chang Lee and Jeonggeun Song. Symba: Symmetric backpropagation-free contrastive learning with forward-forward algorithm for optimizing convergence, 2023
work page 2023
-
[40]
Efficient learning of sparse representations with an energy-based model
Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann Lecun. Efficient learning of sparse representations with an energy-based model. 01 2006
work page 2006
-
[41]
Hinton, Simon Osindero, and Yee-Whye Teh
Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006
work page 2006
-
[42]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49. IEEE Press, 2016
work page 2016
-
[43]
NVIDIA. Gpu memory management, 2020. NVIDIA Developer Documentation
work page 2020
-
[44]
Santander customer transaction prediction, 2018
Santander. Santander customer transaction prediction, 2018
work page 2018
- [45]
-
[46]
Learning word vectors for sentiment analysis
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, 2011. 12
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.