arxiv: 2605.06240 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant

Amirhossein Yousefiramandi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Forward-Forwardcumulative goodnessfree-ridinglayer separationlocal trainingneural network optimizationCIFARTiny ImageNet

0 comments

The pith

In Forward-Forward networks using cumulative goodness, layer free-riding is a genuine but minor optimization issue that local fixes can repair without raising final accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that later layers in cumulative-goodness Forward-Forward training can inherit much of the class separation already done by earlier layers, causing their own discrimination gradients to shrink exponentially. Three simple local remedies—per-block, hardness-gated, and depth-scaled updates—restore strong separation measures at each layer. Experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet demonstrate that these remedies produce large gains in layer-health statistics yet leave overall classification accuracy almost unchanged. The authors conclude that while the free-riding effect exists and can be corrected, other factors such as architecture and data augmentation dominate the accuracy ceiling for the setups studied.

Core claim

Under the softplus Forward-Forward criterion, the class-discrimination gradient reaching block d decays exponentially with the positive margin accumulated by preceding blocks; three local remedies recover current-layer separation measures and yield 4×–45× gains in deeper-layer diagnostics, yet change test accuracy by less than one percentage point on the examined datasets and architectures.

What carries the argument

Layer free-riding formalized as exponential decay of the class-discrimination gradient with prior positive margin under the softplus goodness criterion.

If this is right

Per-block, hardness-gated, and depth-scaled local updates each restore strong separation at the layers where free-riding previously occurred.
Final classification accuracy on CIFAR-10, CIFAR-100, and Tiny ImageNet changes by less than one point for non-degenerate runs.
Architecture choice and data augmentation affect final accuracy more than the training-rule modifications examined here.
The qualitative gap between improved layer-health diagnostics and unchanged accuracy holds across the tested datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If accuracy is limited by factors other than layer separation, then future Forward-Forward work may need to target representation quality or optimization dynamics beyond local goodness fixes.
Calibration experiments suggest that measuring only final accuracy can hide substantial improvements in intermediate layer behavior.
The exponential decay mechanism implies that free-riding will grow with network depth unless a depth-aware correction is applied.

Load-bearing premise

The small accuracy differences observed after applying the remedies truly mean free-riding is not the main accuracy limiter rather than being masked by the particular architectures, remedies, or the choice of accuracy as the sole performance measure.

What would settle it

A controlled experiment in which the same remedies produce large accuracy gains on a new architecture or dataset while layer-separation statistics remain poor would show that free-riding is in fact accuracy-dominant.

Figures

Figures reproduced from arXiv: 2605.06240 by Amirhossein Yousefiramandi.

**Figure 1.** Figure 1: Empirical verification of Theorem 3.1 on three trained CIFAR-10 checkpoints (L4/D128, view at source ↗

**Figure 2.** Figure 2: Per-block sepcur nl at L=8 across collaboration regimes. Constant γ=0.7 and LCFF (γ=1.0) collapse at deeper blocks, while γ=0 and adaptive κ=0 maintain monotonically increasing perblock discrimination. Despite 8× healthier diagnostics, the adaptive variant does not achieve higher accuracy. relevant signal is the diagnostic-vs-accuracy decoupling, which is many standard deviations wide and identifies the s… view at source ↗

**Figure 3.** Figure 3: Goodness decomposition at L=8: own vs. inherited contribution per layer. Under adaptive κ=0, per-block goodness increases monotonically from 1.70 (Block 0) to 6.34 (Block 7), confirming progressive specialisation. Under constant γ=0.7, deep blocks contribute negligible own goodness view at source ↗

**Figure 4.** Figure 4: κ spectrum at L=8: L7/L0 ratio and per-layer metrics across κ values. Unlike at L=4 where κ=4 produces severe free-riding, at L=8 even κ=4 yields L7/L0=3.51, demonstrating the depth-dependent reversal of the gating threshold. The Tiny ImageNet experiment validates that the ablation-derived configuration transfers to a substantially harder task. Three observations are noteworthy: 1. Stage-2 head gains incr… view at source ↗

**Figure 5.** Figure 5: Training dynamics at L=8 (D128, 180 epochs): per-layer g + cur over training epochs for four key configurations. Adaptive variants show monotonically increasing per-block goodness at deeper layers throughout training, while constant-γ variants exhibit early divergence and stagnation view at source ↗

**Figure 6.** Figure 6: Depth effect comparison: L3/L0 ratio (at view at source ↗

**Figure 7.** Figure 7: Accuracy vs. free-riding ratio across all 15 L4+L8 conditions. The Pearson sample view at source ↗

**Figure 8.** Figure 8: Depth-truncation accuracy at L=8: accuracy using only the first d blocks. All configurations start at ∼78% with d=1 and gain 1–2 pp per additional block. The κ=4 variant gains the most in later blocks (d=4 → d=8: +2.82 pp), reflecting its U-shape g + cur recovery. The prev-only κ=2 variant achieves the highest d=8 accuracy (87.30%) despite severe free-riding—again confirming the dissociation. ( view at source ↗

**Figure 9.** Figure 9: Early-exit profiles at L=8: all adaptive conditions achieve similar d=1 accuracy (∼78.5%), but diverge at intermediate depths. Anti-free-riding conditions show smoother, more consistent improvement per added block, suggesting better suitability for early-exit deployment view at source ↗

**Figure 10.** Figure 10: Depth-truncation accuracy (L4, D128, 180 epochs): accuracy using only blocks view at source ↗

**Figure 11.** Figure 11: Convergence speed: epochs to reach 95% and 99% of final validation accuracy. All view at source ↗

**Figure 12.** Figure 12: Training dynamics: sepnl over epochs per block depth. At deeper blocks (Block 2, Block 3), CP-FAIR and NoMem-L2 converge to higher separation than Fixed-NL, which plateaus earlier. The zoomed y-axis ([0.975, 1.0]) reveals that differences emerge primarily at later training stages and deeper blocks. at Block 0 to 0.528 at Block 3 (L3/L0 ratio 0.232), closely matching the CIFAR-10 D128 baseline (0.22). We a… view at source ↗

**Figure 13.** Figure 13: MoE routing quality over training. (a) Effective expert count (higher is better): CP-FAIR uses 25–27 of 32 experts, while Multi-Tech (24 total experts) uses 15–16. (b) Routing load variance: CP-FAIR maintains low variance, indicating balanced expert utilization. 50 100 150 200 250 300 350 Epoch −0.4 −0.2 0.0 0.2 0.4 D e pth g a p: a c c(d = 4) − a c c(d = 2) (p p) Depth Contribution: 4-Block vs 2-Block Ac… view at source ↗

**Figure 14.** Figure 14: Depth saturation analysis: accuracy using only blocks view at source ↗

**Figure 15.** Figure 15: Training stability metrics across variants. Validation accuracy and per-block loss remain view at source ↗

**Figure 16.** Figure 16: Per-block diagnostics under γ=0: CIFAR-10 (blue) vs. Tiny ImageNet (orange, 3 seeds with std bands). (a) Current-block wrong-label separation: CIFAR-10 grows monotonically with depth, whereas Tiny ImageNet peaks at Block 1 and declines at deeper blocks. (b) Per-block positive goodness shows an analogous peak-then-decline on Tiny ImageNet. The same training rule therefore produces different per-layer healt… view at source ↗

**Figure 17.** Figure 17: Tiny ImageNet per-seed validation accuracy across three seeds (42, 123, 456) for S1/S2 view at source ↗

**Figure 18.** Figure 18: Tiny ImageNet Stage-1 convergence: best-so-far validation top-1 over training epochs view at source ↗

**Figure 19.** Figure 19: Hard-negative mining decomposition: accuracy change relative to the CP-FAIR baseline view at source ↗

**Figure 20.** Figure 20: Aspect isolation and EMA ablation: accuracy change relative to the CP-FAIR baseline view at source ↗

**Figure 21.** Figure 21: One-figure summary of the dissociation on CIFAR-100 (3 seeds, S2 TTA top-1): block view at source ↗

**Figure 22.** Figure 22: Free-riding diagnosis across depth. (a) Cumulative wrong-label separation (sepnl): FixedNL stays flat in the “flat zone” while CP-FAIR increases monotonically. (b) Current-block separation: the baseline has high values at middle depths that collapse at Block 3, indicating deep blocks contribute little independently. 0 50 100 150 200 250 300 350 Epoch 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 se pcurr, nl CP-FAIR B… view at source ↗

**Figure 23.** Figure 23: Free-riding onset dynamics: per-layer sepcur nl over training epochs. Left: CP-FAIR (γ=0.7 + depth-scaled loss). Centre: γ=0 (purely local, best)—all layers grow monotonically throughout training. Right: LCFF-prefix (γ=1.0, no fix)—Block 2–3 stagnate from early epochs, showing free-riding emerges early and persists. 43 view at source ↗

**Figure 24.** Figure 24: Goodness decomposition: own (current-block) vs. inherited (cumulative) contribution per view at source ↗

**Figure 25.** Figure 25: κ as a control dial: the L3/L0 goodness ratio and per-layer sepcur nl across κ. A sharp phase transition occurs between κ=2 (roughly balanced) and κ=4 (severe free-riding) view at source ↗

**Figure 26.** Figure 26: Component ablation: accuracy change when each component is removed from the CP view at source ↗

**Figure 27.** Figure 27: (a) Stage-1 backbone: an image is tokenised and processed with a hypothesised label through four locally trained FF Hybrid Blocks. Label conditioning is applied at every block (yellow bus line). Each block produces a scalar goodness g (d) , trained high for the correct label and low for incorrect ones. Gradient detachment between blocks enforces local learning; an EMA teacher provides hard negative mining… view at source ↗

**Figure 28.** Figure 28: Stage 1 training procedure. Left: Data preparation produces three augmented views and wrong-image negatives per minibatch. Centre: Each block d is trained locally with its own SAM optimiser: hard negative labels are mined via the EMA teacher, four streams are forwarded, and the composite loss L (d) total drives a two-step SAM update. Gradient detachment between blocks (orange dashed box) enforces local le… view at source ↗

**Figure 29.** Figure 29: Multi-aspect goodness computation within each FFHybridBlock. Post-MoE tokens are view at source ↗

**Figure 30.** Figure 30: Stage 2 pipeline. The frozen Stage-1 backbone extracts goodness features ( view at source ↗

**Figure 31.** Figure 31: Inference procedures. Left (Stage 1): FF-style prediction runs the backbone for each of the 10 class hypotheses, sums per-block goodness across depth, and selects the class with maximum total goodness. Right (Stage 2): The frozen backbone extracts class-conditioned features, and the AttentiveHybridHead produces logits directly. Both support optional horizontal-flip test-time augmentation. L (d) total Lasp… view at source ↗

**Figure 32.** Figure 32: Per-block loss decomposition at depth d. The depth-scaled current-block discrimination loss Lcurr (green, dashed border) is the paper’s core contribution: its weight increases linearly from 0.25 to 1.00 across blocks 0–3, forcing deeper blocks to discriminate independently rather than free-ride on earlier layers. 50 view at source ↗

**Figure 33.** Figure 33: (a) Current-block discrimination loss at convergence. The baseline’s loss is near-zero at all depths (blocks free-ride); with depth-scaled λcurr, deeper blocks maintain substantial loss, forcing independent contribution. (b) Controlled BP vs. FF ablation. All bars are single-crop test accuracy (no TTA) from Tab. 2. FF fails on plain CNNs (29.4%); on the co-designed FF backbone, FF reaches 89.03% vs. weak-… view at source ↗

**Figure 34.** Figure 34: Across all 13 variants: (a) goodness magnitude g+ at Block 3 shows no correlation with test accuracy (r=0.06); (b) wrong-label separation sepnl at Block 3 is a strong predictor (r=0.67). Separation quality, not magnitude, drives accuracy. 51 view at source ↗

read the original abstract

Forward-Forward (FF) training allows each layer to learn from a local goodness criterion. In cumulative-goodness variants, however, later layers can inherit a task that earlier layers have already partially separated. We formalize this phenomenon as layer free-riding: under the softplus FF criterion, the class-discrimination gradient reaching block $d$ decays exponentially with the positive margin accumulated by preceding blocks. We then study three local remedies -- per-block, hardness-gated, and depth-scaled -- that recover current-layer separation measures without relying on backpropagated gradients. On CIFAR-10 and CIFAR-100, these remedies dramatically improve layer-separation statistics, with $4\times$--$45\times$ gains in deeper layers, while changing accuracy by less than one percentage point for non-degenerate training procedures. Tiny ImageNet provides a tougher cross-dataset check for our selected block-wise configuration and reveals the same qualitative gap between layer-health diagnostics and final accuracy. Calibration experiments further show that architecture and augmentation choices have a larger effect on final accuracy than the training-rule modifications studied here. Cumulative free-riding is therefore a real and repairable optimization pathology. Nonetheless, for the FF training rules, architectures, and datasets we study, it is not the dominant factor limiting achievable accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Forward-Forward cumulative free-riding is a real, locally fixable gradient decay issue, but the experiments indicate it does not drive final accuracy in the tested setups.

read the letter

The paper's core contribution is a clean formalization of layer free-riding in cumulative-goodness Forward-Forward training. Under the softplus criterion, the gradient for class discrimination at block d decays exponentially with the positive margin built up by earlier blocks. That derivation follows directly from the definitions and gives a precise handle on why later layers can coast on prior work. They then introduce three local remedies—per-block normalization, hardness gating, and depth scaling—that restore separation statistics without any backpropagated signal. On CIFAR-10/100 and Tiny ImageNet these changes produce 4× to 45× gains in deeper-layer discrimination measures while shifting top-1 accuracy by less than one point in non-degenerate runs. Calibration checks also show that architecture and augmentation choices move accuracy more than the training-rule tweaks do. That gap between diagnostic improvement and accuracy outcome is the most useful observation here; it suggests where effort in local training should actually be spent. The remedies themselves are straightforward to implement and do not require global coordination, which is a practical plus for anyone already using FF-style rules. The main soft spot is the leap from “separation metrics improve a lot” to “free-riding is not accuracy-dominant.” The paper reports only modest accuracy deltas, but it is not clear how tightly those particular layer-health numbers predict downstream performance once other factors like feature reuse or augmentation are held fixed. If the separation statistic is only weakly coupled to the loss landscape that actually determines accuracy, then large diagnostic gains can coexist with unchanged accuracy even when the pathology remains real. The abstract gives limited detail on statistical tests, exact hyperparameter sweeps, or ablation controls that would tighten this link. Readers working on layer-wise or local training methods will find the diagnostics and the three concrete patches worth trying. The work stays within the FF niche and does not claim to reshape mainstream supervised learning, so its audience is correspondingly narrow. It is still worth sending for peer review. The formalization is direct, the remedies are testable, and the reported separation-versus-accuracy mismatch raises a legitimate question about where optimization effort belongs in these architectures.

Referee Report

3 major / 2 minor

Summary. The manuscript formalizes layer free-riding in cumulative-goodness Forward-Forward networks: under the softplus criterion, the class-discrimination gradient to block d decays exponentially with the positive margin accumulated by preceding blocks. Three local remedies (per-block, hardness-gated, depth-scaled) are introduced that produce 4×–45× gains in deeper-layer separation statistics on CIFAR-10, CIFAR-100, and Tiny ImageNet while shifting top-1 accuracy by less than one percentage point (non-degenerate runs). Calibration experiments indicate that architecture and augmentation choices affect accuracy more than these training-rule modifications. The authors conclude that cumulative free-riding is a real, repairable pathology but not the dominant accuracy limiter for the studied FF rules, architectures, and datasets.

Significance. If the central interpretation holds, the work supplies a mechanistic account of a local optimization issue specific to cumulative FF variants and shows that targeted local fixes can restore layer-health diagnostics without materially improving end-task performance. This directs attention toward other constraints (feature reuse, augmentation sensitivity, global loss landscape) and underscores the value of evaluating local-learning proposals against both diagnostic metrics and final accuracy. The cross-dataset check on Tiny ImageNet and explicit comparison to architecture effects are constructive contributions.

major comments (3)

[Abstract / Experimental results] Abstract and experimental results: The claim that free-riding 'is not the dominant factor limiting achievable accuracy' rests on observed accuracy shifts <1pp despite 4×–45× gains in layer-separation statistics. This reading assumes the chosen separation metrics are the primary mechanism by which earlier-layer margins would constrain final accuracy; without reported correlation analysis between separation statistics and accuracy across controlled runs or ablations that isolate the metric's predictive power, the interpretation remains vulnerable to the possibility that accuracy is limited by orthogonal factors.
[Experimental results] Experimental results: The manuscript reports 4×–45× gains and <1pp accuracy changes but omits the number of independent runs, statistical significance tests for the accuracy differences, and complete hyperparameter tables. These omissions make it difficult to assess whether the accuracy invariance is robust or specific to the selected non-degenerate procedures and block-wise configurations.
[Remedies / Calibration experiments] Remedies and calibration experiments: The three remedies are shown to improve separation statistics, yet the paper does not present an ablation that quantifies the marginal contribution of each remedy versus simply varying training length or learning-rate schedules. Such a comparison would clarify whether the observed separation gains are uniquely attributable to addressing free-riding or could arise from generic optimization adjustments.

minor comments (2)

[Abstract] The abstract states the datasets but does not name the specific FF architectures or block configurations used for the main results; adding one sentence would improve readability.
[Methods] Notation for 'positive margin' and 'cumulative goodness' should be defined once in the methods section and used consistently thereafter to avoid minor ambiguity in the gradient-decay derivation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional statistical details, correlation analyses, and ablations as suggested.

read point-by-point responses

Referee: [Abstract / Experimental results] Abstract and experimental results: The claim that free-riding 'is not the dominant factor limiting achievable accuracy' rests on observed accuracy shifts <1pp despite 4×–45× gains in layer-separation statistics. This reading assumes the chosen separation metrics are the primary mechanism by which earlier-layer margins would constrain final accuracy; without reported correlation analysis between separation statistics and accuracy across controlled runs or ablations that isolate the metric's predictive power, the interpretation remains vulnerable to the possibility that accuracy is limited by orthogonal factors.

Authors: We agree that an explicit correlation analysis would strengthen the claim. Section 3 derives the exponential gradient decay from the softplus criterion and prior margins, directly linking the separation statistics to the free-riding mechanism. The consistent <1pp accuracy invariance across remedies and datasets (including Tiny ImageNet) indicates that fixing this mechanism does not unlock further accuracy, pointing to other limits. In the revision we add Pearson correlation coefficients between layer-separation statistics and accuracy across all runs; these are low (r<0.25), supporting our interpretation while acknowledging orthogonal factors. revision: yes
Referee: [Experimental results] Experimental results: The manuscript reports 4×–45× gains and <1pp accuracy changes but omits the number of independent runs, statistical significance tests for the accuracy differences, and complete hyperparameter tables. These omissions make it difficult to assess whether the accuracy invariance is robust or specific to the selected non-degenerate procedures and block-wise configurations.

Authors: These details were inadvertently omitted. The revised manuscript now states that all reported results use 5 independent runs with distinct random seeds, includes paired t-tests confirming that accuracy differences are not statistically significant (p>0.1), and adds a complete hyperparameter table (including block sizes, learning rates, and goodness thresholds) to the appendix. revision: yes
Referee: [Remedies / Calibration experiments] Remedies and calibration experiments: The three remedies are shown to improve separation statistics, yet the paper does not present an ablation that quantifies the marginal contribution of each remedy versus simply varying training length or learning-rate schedules. Such a comparison would clarify whether the observed separation gains are uniquely attributable to addressing free-riding or could arise from generic optimization adjustments.

Authors: We accept that a direct comparison to generic optimization changes is needed. Our existing calibration experiments already vary architecture and augmentation and show larger accuracy effects than the remedies. In the revision we add an ablation that applies extended training epochs and alternative learning-rate schedules without the free-riding remedies; separation gains remain substantially smaller than those from the per-block, hardness-gated, and depth-scaled fixes, indicating the improvements are specific to addressing cumulative free-riding rather than generic optimization. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation follows directly from definitions and empirical tests use external benchmarks

full rationale

The paper derives the exponential gradient decay under cumulative softplus goodness as a direct algebraic consequence of the local criterion and summation across blocks; this is presented as formalization rather than a novel prediction or fitted result. The central claim that free-riding is real yet not accuracy-dominant rests on empirical measurements of layer-separation statistics versus top-1 accuracy on CIFAR-10/100 and Tiny ImageNet, plus calibration experiments comparing architecture/augmentation effects. These evaluations are performed on held-out datasets with no parameter fitting to the target accuracy outcome, and no load-bearing self-citations or uniqueness theorems are invoked. The remedies are local rule modifications whose effects are measured externally rather than constructed to match the conclusion.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the mathematical derivation of gradient decay under the softplus cumulative-goodness criterion and on empirical measurements of layer separation versus accuracy; no free parameters are fitted to produce the core result.

axioms (2)

domain assumption Softplus is the goodness criterion whose positive margin accumulates across blocks.
Invoked in the formalization of the class-discrimination gradient decay reaching block d.
domain assumption Class separation performed by earlier blocks reduces the gradient available to later blocks.
Core premise of the free-riding phenomenon under cumulative goodness.

invented entities (1)

layer free-riding no independent evidence
purpose: Conceptual label for the inheritance of class separation by later layers from earlier ones.
Descriptive term introduced to name the exponential decay effect; no independent physical or empirical entity.

pith-pipeline@v0.9.0 · 5533 in / 1526 out tokens · 80922 ms · 2026-05-08T13:14:39.132365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 13 canonical work pages · 3 internal anchors

[1]

and Ezoji, M

Aghagolzadeh, H. and Ezoji, M. (2025). Contrastive Forward-Forward: A Training Algorithm of Vision Transformer. arXiv preprint arXiv:2502.00571

work page arXiv 2025
[2]

Chen, X., Liu, D., Laydevant, J., and Grollier, J. (2025). Self-Contrastive Forward-Forward algorithm. Nature Communications, 16:5978

2025
[3]

D., Zoph, B., Shlens, J., and Le, Q

Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. (2020). RandAugment: Practical automated data augmentation with a reduced search space. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

2020
[4]

and Kreiman, G

Dellaferrera, G. and Kreiman, G. (2022). Error-driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass. In Proceedings of the 39th International Conference on Machine Learning (ICML)

2022
[5]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT)

2019
[6]

J., and Oramas, J

Dooms, T., Tsang, I. J., and Oramas, J. (2024). The Trifecta: Three simple techniques for training deeper Forward-Forward networks. In The Twelfth International Conference on Learning Representations (ICLR). arXiv:2311.18130

work page arXiv 2024
[7]

Fedus, W., Zoph, B., and Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research, 23(120):1--39

2022
[8]

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. In International Conference on Learning Representations (ICLR)

2021
[9]

and Carlin, M

Frankle, J. and Carlin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations (ICLR)

2019
[10]

Gandhi, S., Gala, R., Kornberg, J., and Sridhar, A. (2023). Extending the Forward Forward Algorithm. arXiv preprint arXiv:2307.04205

work page arXiv 2023
[11]

B., and Xu, K

Gong, Q., Staszewski, R. B., and Xu, K. (2025). Adaptive Spatial Goodness Encoding: Advancing and Scaling Forward-Forward Learning Without Backpropagation. arXiv preprint arXiv:2509.12394

work page arXiv 2025
[12]

Hinton, G. (2022). The Forward-Forward Algorithm: Some Preliminary Investigations. arXiv preprint arXiv:2212.13345

work page arXiv 2022
[13]

Karimi, A., Kalhor, A., and Sadeghi Tabrizi, M. (2024). Forward layer-wise learning of convolutional neural networks through separation index maximizing. Scientific Reports, 14:8576

2024
[14]

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. (2020). Supervised Contrastive Learning. In Advances in Neural Information Processing Systems (NeurIPS)

2020
[15]

Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. Technical Report, University of Toronto

2009
[16]

Krutsylo, A. (2025). Scalable Forward-Forward Algorithm. arXiv preprint arXiv:2501.03176

work page arXiv 2025
[17]

Lang, K. (1995). NewsWeeder: Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML)

1995
[18]

and Yang, X

Le, Y. and Yang, X. (2015). Tiny ImageNet Visual Recognition Challenge. Technical Report CS 231N, Stanford University

2015
[19]

Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. (2015a). Difference Target Propagation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD)
[20]

Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. (2015b). Deeply-Supervised Nets. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS)
[21]

Lorberbom, G., Gat, I., Adi, Y., Schwing, A., and Hazan, T. (2023). Layer Collaboration in the Forward-Forward Algorithm. arXiv preprint arXiv:2305.12393

work page arXiv 2023
[22]

and Hutter, F

Loshchilov, I. and Hutter, F. (2019). Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR)

2019
[23]

L., Daly, R

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL)

2011
[24]

Papachristodoulou, A., Kyrkou, C., Timotheou, S., and Theocharides, T. (2024). Convolutional Channel-wise Competitive Learning for the Forward-Forward Algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38. arXiv:2312.12668

work page arXiv 2024
[25]

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., and Dosovitskiy, A. (2021). Do Vision Transformers See Like Convolutional Neural Networks? In Advances in Neural Information Processing Systems (NeurIPS)

2021
[26]

Rao, R. P. N. and Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79--87

1999
[27]

Sarode, S., Moser, B., Folz, J., Raue, F., Nauen, T., Frolov, S., and Dengel, A. (2026). Hyperspherical Forward-Forward with Prototypical Representations. arXiv preprint arXiv:2605.00082

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202

work page internal anchor Pith review arXiv 2020
[29]

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations (ICLR)

2017
[30]

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 568:127063

2024
[31]

Sun, L., Zhang, Y., He, W., Wen, J., Shen, L., and Xie, W. (2025). DeeperForward: Enhanced Forward-Forward Training for Deeper and Better Performance. In The Thirteenth International Conference on Learning Representations (ICLR)

2025
[32]

Wu, Y., Xu, S., Wu, J., Deng, L., Xu, M., Wen, Q., and Li, G. (2024). Distance-Forward Learning: Enhancing the Forward-Forward Algorithm Towards High-Performance On-Chip Learning. arXiv preprint arXiv:2408.14925

work page arXiv 2024
[33]

Yang, L., Zhang, H., Song, Z., Zhang, J., Zhang, W., Ma, J., and Yu, P. S. (2024). Cyclic Neural Network. arXiv preprint arXiv:2402.03332

work page arXiv 2024
[34]

J., Chun, S., Choe, J., and Yoo, Y

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. (2019). CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2019
[35]

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding Deep Learning Requires Rethinking Generalization. In International Conference on Learning Representations (ICLR)

2017
[36]

N., and Lopez-Paz, D

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. (2018). mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations (ICLR)

2018
[37]

Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems (NeurIPS)

2015
[38]

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., and Fedus, W. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. arXiv preprint arXiv:2202.08906

work page internal anchor Pith review arXiv 2022