A Unified Framework of Online Learning Algorithms for Training Recurrent Neural Networks

Cristina Savin; Kyunghyun Cho; Owen Marschall

arxiv: 1907.02649 · v1 · pith:4DSUAFA2new · submitted 2019-07-05 · 💻 cs.LG · cs.NE· q-bio.NC· stat.ML

A Unified Framework of Online Learning Algorithms for Training Recurrent Neural Networks

Owen Marschall , Kyunghyun Cho , Cristina Savin This is my paper

Pith reviewed 2026-05-25 02:04 UTC · model grok-4.3

classification 💻 cs.LG cs.NEq-bio.NCstat.ML

keywords online learningrecurrent neural networksunified frameworkbiologically plausibleperformance clusteringsynthetic tasksgradient alignmentstochastic algorithms

0 comments

The pith

A four-criteria framework unifies recent online RNN training algorithms and accounts for their performance clustering on tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper offers a compact way to organize many recent algorithms for training recurrent neural networks online in efficient or biologically plausible ways. Algorithms are sorted by whether they face the past or future, their tensor structure, whether they are stochastic or deterministic, and whether they use closed-form or numerical solutions. These groupings expose hidden links between different methods and supply mathematical reasons for why some succeed more than others. Tests on two synthetic tasks confirm that algorithm results group together according to the criteria, though alignment with exact gradients does not fully predict final performance, especially among stochastic methods.

Core claim

By classifying online RNN training algorithms according to past vs. future facing, tensor structure, stochastic vs. deterministic, and closed form vs. numerical criteria, the framework reveals latent conceptual connections among recent advances, supplies novel mathematical intuitions for their degrees of success, and demonstrates that performances cluster according to these criteria on synthetic tasks, while noting that gradient alignment with exact methods does not alone explain ultimate performance, particularly for stochastic algorithms, and calling for better comparison metrics.

What carries the argument

The four classification criteria (past vs. future facing, tensor structure, stochastic vs. deterministic, closed form vs. numerical) used to organize algorithms and explain performance clusters.

If this is right

Algorithms sharing the same values on the four criteria will tend to achieve similar performance levels on the tested tasks.
Gradient alignment with exact methods produces a similar clustering pattern but fails to account for final performance differences, especially in stochastic cases.
Better comparison metrics beyond gradient alignment are needed to evaluate stochastic online learning algorithms.
The framework allows recent advances in online RNN training to be summarized compactly through shared conceptual connections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could deliberately combine criteria from high-performing algorithms to create new hybrids with targeted properties.
The emphasis on future-facing and deterministic methods may connect to why certain biologically inspired rules succeed in practice.
Applying the same axes to non-synthetic data could expose whether task structure interacts with the criteria to change clustering patterns.

Load-bearing premise

The four criteria are the primary cause of the observed performance clustering on the two synthetic tasks rather than task-specific details or other unaccounted variables in the setup.

What would settle it

Re-running the same set of algorithms on additional tasks while varying only the criteria assignments and holding other factors fixed, then checking whether the performance clusters break apart or reform along different lines.

Figures

Figures reproduced from arXiv: 1907.02649 by Cristina Savin, Kyunghyun Cho, Owen Marschall.

**Figure 2.** Figure 2: A visualization of the influence matrix and its 3 indices [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: A visualization of various exact gradient methods. Each plot contains a lattice of [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: a) Cross-entropy loss for networks trained on Add task with α = 1 for various algorithms. Lines are means over 20 random seeds (weight initialization and training set generation), and shaded regions represent ±1 S.E.M. Raw loss curves are first down-sampled by a factor of 10−4 (rectangular kernel) and then smoothed with a 10-time-step windowed running average. b) Same for α = 0.5. For each task, we conside… view at source ↗

**Figure 5.** Figure 5: Same as Fig. 4, for Mimic task with mean-squared-error loss. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: a) Histograms of normalized gradient alignments for each pair of algorithms. Gradients are calculated during a simulation of 100k time steps of the Add task (same hyperparameters as in Figs. 4). Learning follows RTRL gradients, with other algorithms’ gradients passively computed for comparison. Mean alignment (dashed blue line) and 0 alignment reference (dashed black line) shown. b) Mean alignments of each… view at source ↗

**Figure 7.** Figure 7: The joint distribution of normalized alignments and gradient norms (log scale). [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: RFLO as a static multi-layer regression. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Cartoon illustrating how alignment with RTRL and performance might dissociate. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

We present a framework for compactly summarizing many recent results in efficient and/or biologically plausible online training of recurrent neural networks (RNN). The framework organizes algorithms according to several criteria: (a) past vs. future facing, (b) tensor structure, (c) stochastic vs. deterministic, and (d) closed form vs. numerical. These axes reveal latent conceptual connections among several recent advances in online learning. Furthermore, we provide novel mathematical intuitions for their degree of success. Testing various algorithms on two synthetic tasks shows that performances cluster according to our criteria. Although a similar clustering is also observed for gradient alignment, alignment with exact methods does not alone explain ultimate performance, especially for stochastic algorithms. This suggests the need for better comparison metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a coherent four-axis classification for online RNN training methods and notes that gradient alignment falls short as an explanation, but the clustering evidence from two tasks does not establish that the axes drive the results.

read the letter

The main thing here is a classification scheme that groups online RNN training algorithms along four axes: past versus future facing, tensor structure, stochastic versus deterministic, and closed form versus numerical. The authors use it to connect several recent methods, supply some mathematical intuitions about their relative success, and report that algorithm performances cluster by these criteria on two synthetic tasks. They also observe that alignment with exact gradients does not fully account for performance differences, especially among stochastic algorithms, and conclude that better comparison metrics are needed. This organizational lens and the point about gradient alignment are new relative to the cited prior work. The framework does pull disparate approaches together in a compact way and flags a real limitation in how these methods are usually evaluated. The soft spot is the empirical support. The clustering is observed, but with only two tasks and no ablations that hold other factors fixed while varying one axis, it is not clear the groupings are caused by the proposed criteria rather than task-specific details. The stress-test concern lands: the evidence stays correlational, and the paper itself shows that another variable (gradient alignment) produces similar clusters without explaining outcomes. This is a paper for researchers already working on efficient or biologically plausible online RNN training. A reader in that niche can extract the connections and the call for better metrics. It shows clear thinking on its own terms and engages the literature honestly, so it deserves a serious referee. I would send it to review and ask the authors to add targeted ablations or more tasks to test whether the axes are the operative variables.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a unified framework for online learning algorithms used in training recurrent neural networks (RNNs). Algorithms are organized according to four criteria: (a) past vs. future facing, (b) tensor structure, (c) stochastic vs. deterministic, and (d) closed form vs. numerical. The framework is claimed to reveal latent conceptual connections among recent advances, and the authors provide novel mathematical intuitions for algorithm success. Experiments on two synthetic tasks show that performances cluster according to the criteria; a similar clustering occurs for gradient alignment, but alignment does not explain ultimate performance, especially for stochastic methods, suggesting the need for better comparison metrics.

Significance. If the classification axes prove robust, the framework offers a compact way to summarize and connect results on efficient and biologically plausible online RNN training, which could guide algorithm design. The explicit observation that gradient alignment fails to explain performance (particularly for stochastic algorithms) is a credit to the work, as it identifies a gap and calls for improved metrics. The organizational approach itself, independent of the empirical claims, has value for the field even if the clustering evidence requires strengthening.

major comments (2)

[Experimental evaluation] Experimental evaluation section: the claim that performances 'cluster according to our criteria' rests on results from only two synthetic tasks. No ablation studies are described that vary one axis (e.g., stochastic vs. deterministic) while holding the others fixed, nor are statistical tests or variance estimates across runs reported. This leaves open the possibility that observed groupings reflect task artifacts (sequence length, noise, loss surface) rather than the four criteria, consistent with the stress-test concern.
[Framework and results discussion] Framework and results discussion: while the paper correctly notes that gradient alignment produces similar clusters yet does not explain performance for stochastic methods, it does not quantify how much of the performance variance is captured by each of the four axes versus unmeasured confounders. A regression or variance decomposition relating the axes to observed performance would make the explanatory claim more load-bearing.

minor comments (2)

A summary table explicitly mapping each discussed algorithm to the four criteria would improve readability and allow readers to verify the classification.
[Abstract] The abstract states that 'novel mathematical intuitions' are provided but does not give even a one-sentence example; adding a brief illustration would help readers assess the contribution without reading the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for minor revision. We address the major comments point by point below.

read point-by-point responses

Referee: [Experimental evaluation] Experimental evaluation section: the claim that performances 'cluster according to our criteria' rests on results from only two synthetic tasks. No ablation studies are described that vary one axis (e.g., stochastic vs. deterministic) while holding the others fixed, nor are statistical tests or variance estimates across runs reported. This leaves open the possibility that observed groupings reflect task artifacts (sequence length, noise, loss surface) rather than the four criteria, consistent with the stress-test concern.

Authors: We agree that expanding the experimental evaluation would strengthen the manuscript. The two synthetic tasks were chosen as standard benchmarks in the field for evaluating online RNN training algorithms, allowing direct comparison with prior work. While we did not perform explicit ablations varying one axis at a time, the framework is designed such that algorithms differ along these axes, and the clustering is observed consistently. In the revised version, we will report variance estimates from multiple independent runs and include statistical tests (e.g., ANOVA) to assess the significance of performance differences between clusters. We will also add a supplementary analysis highlighting the contribution of each axis by comparing subsets of algorithms. revision: partial
Referee: [Framework and results discussion] Framework and results discussion: while the paper correctly notes that gradient alignment produces similar clusters yet does not explain performance for stochastic methods, it does not quantify how much of the performance variance is captured by each of the four axes versus unmeasured confounders. A regression or variance decomposition relating the axes to observed performance would make the explanatory claim more load-bearing.

Authors: We acknowledge the value of a quantitative analysis. However, given the small number of algorithms tested (approximately 10-15 across the two tasks), a full regression or variance decomposition would have limited statistical power and risk overfitting. The primary contribution is the observation that clustering occurs along the proposed axes and that gradient alignment alone is insufficient, particularly for stochastic methods. In revision, we will include a more detailed discussion with pairwise performance comparisons that isolate the effect of each axis where possible, and note the limitations of the current analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: classification framework and empirical observations are independent of inputs

full rationale

The paper defines a taxonomy of RNN online learning algorithms using four observable axes (past/future-facing, tensor structure, stochastic/deterministic, closed-form/numerical) drawn from the algorithms' explicit update rules. It then reports that performance on two synthetic tasks clusters along these axes, while noting that gradient alignment alone does not explain outcomes. Neither the taxonomy nor the clustering claim reduces to a fitted parameter, self-definition, or self-citation chain; the axes are descriptive properties external to the performance data, and the empirical result is an observation rather than a constructed prediction. No load-bearing self-citations, ansatzes, or renamings of known results appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an organizational framework applied to existing algorithms and does not introduce new fitted parameters, unproved axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5664 in / 1238 out tokens · 32602 ms · 2026-05-25T02:04:07.606757+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

End-to-end attention-based large vocabulary speech recognition

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Ben- gio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4945–4949. IEEE,

work page 2016
[3]

Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning

Frederik Benzing, Marcelo Matheus Gauy, Asier Mujika, Anders Martinsson, and Angelika Steger. Optimal kronecker-sum approximation of real time recurrent learning. arXiv preprint arXiv:1902.03993,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[4]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merri¨ enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

On the Variance of Unbiased Online Recurrent Optimization

Tim Cooijmans and James Martens. On the variance of unbiased online recurrent optimiza- tion. arXiv preprint arXiv:1902.02405 ,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[6]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

doi: https://doi.org/ 10.1016/j.conb.2019.01.011

ISSN 0959-4388. doi: https://doi.org/ 10.1016/j.conb.2019.01.011. URL http://www.sciencedirect.com/science/article/ pii/S0959438818302009. Machine Learning, Big Data, and Neuroscience. Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Ran- dom synaptic feedback weights support error backpropagation for deep learning. Nature commun...

work page doi:10.1016/j.conb.2019.01.011 2019
[9]

Online learning of recur- rent neural architectures by locally aligning distributed representations

Alexander Ororbia, Ankur Mali, C Lee Giles, and Daniel Kifer. Online learning of recur- rent neural architectures by locally aligning distributed representations. arXiv preprint arXiv:1810.07411,

work page arXiv
[10]

Learning to Adapt by Minimizing Discrepancy

26 A Unified Framework of Online Learning Algorithms II Ororbia, G Alexander, Patrick Haﬀner, David Reitter, and C Lee Giles. Learning to adapt by minimizing discrepancy. arXiv preprint arXiv:1711.11542 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Christopher Roth, Ingmar Kanitscheider, and Ila Fiete

Accessed: 2018-11-13. Christopher Roth, Ingmar Kanitscheider, and Ila Fiete. Kernel RNN learning (keRNL). In International Conference on Learning Representations ,

work page 2018
[12]

Unbiased Online Recurrent Optimization

Corentin Tallec and Yann Ollivier. Unbiased online recurrent optimization. arXiv preprint arXiv:1702.05043,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Lemma for generating rank-1 unbiased estimates For completeness, we state the Lemma from Tallec and Ollivier (2017) in components notation

27 Marschall, Cho, and Savin Appendix A. Lemma for generating rank-1 unbiased estimates For completeness, we state the Lemma from Tallec and Ollivier (2017) in components notation. Given a decomposition of a matrix M∈ Rn×m into r rank-1 components Mij = r∑ k=1 AikBkj , (29) a vector of i.i.d. random variables ν∈ Rr with E[νk] = 1, E[νkνk′] = δkk′, and a l...

work page 2017
[14]

(2019) use (1 − exp(−γi)) rather than αi as a temporal ﬁlter for B(t) ij

(Inspired by an analogous technique used in deep Q-learning from Mnih et al., 2015.) • In the original paper, Roth et al. (2019) use (1 − exp(−γi)) rather than αi as a temporal ﬁlter for B(t) ij . We made this change so that αi makes sense in terms of the α in the forward dynamics of the network and RFLO. Of course, these are equivalent via γi =− log(1− α...

work page 2015

[1] [1]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

End-to-end attention-based large vocabulary speech recognition

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Ben- gio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4945–4949. IEEE,

work page 2016

[3] [3]

Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning

Frederik Benzing, Marcelo Matheus Gauy, Asier Mujika, Anders Martinsson, and Angelika Steger. Optimal kronecker-sum approximation of real time recurrent learning. arXiv preprint arXiv:1902.03993,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[4] [4]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merri¨ enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 ,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

On the Variance of Unbiased Online Recurrent Optimization

Tim Cooijmans and James Martens. On the variance of unbiased online recurrent optimiza- tion. arXiv preprint arXiv:1902.02405 ,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[6] [6]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

doi: https://doi.org/ 10.1016/j.conb.2019.01.011

ISSN 0959-4388. doi: https://doi.org/ 10.1016/j.conb.2019.01.011. URL http://www.sciencedirect.com/science/article/ pii/S0959438818302009. Machine Learning, Big Data, and Neuroscience. Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Ran- dom synaptic feedback weights support error backpropagation for deep learning. Nature commun...

work page doi:10.1016/j.conb.2019.01.011 2019

[9] [9]

Online learning of recur- rent neural architectures by locally aligning distributed representations

Alexander Ororbia, Ankur Mali, C Lee Giles, and Daniel Kifer. Online learning of recur- rent neural architectures by locally aligning distributed representations. arXiv preprint arXiv:1810.07411,

work page arXiv

[10] [10]

Learning to Adapt by Minimizing Discrepancy

26 A Unified Framework of Online Learning Algorithms II Ororbia, G Alexander, Patrick Haﬀner, David Reitter, and C Lee Giles. Learning to adapt by minimizing discrepancy. arXiv preprint arXiv:1711.11542 ,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Christopher Roth, Ingmar Kanitscheider, and Ila Fiete

Accessed: 2018-11-13. Christopher Roth, Ingmar Kanitscheider, and Ila Fiete. Kernel RNN learning (keRNL). In International Conference on Learning Representations ,

work page 2018

[12] [12]

Unbiased Online Recurrent Optimization

Corentin Tallec and Yann Ollivier. Unbiased online recurrent optimization. arXiv preprint arXiv:1702.05043,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Lemma for generating rank-1 unbiased estimates For completeness, we state the Lemma from Tallec and Ollivier (2017) in components notation

27 Marschall, Cho, and Savin Appendix A. Lemma for generating rank-1 unbiased estimates For completeness, we state the Lemma from Tallec and Ollivier (2017) in components notation. Given a decomposition of a matrix M∈ Rn×m into r rank-1 components Mij = r∑ k=1 AikBkj , (29) a vector of i.i.d. random variables ν∈ Rr with E[νk] = 1, E[νkνk′] = δkk′, and a l...

work page 2017

[14] [14]

(2019) use (1 − exp(−γi)) rather than αi as a temporal ﬁlter for B(t) ij

(Inspired by an analogous technique used in deep Q-learning from Mnih et al., 2015.) • In the original paper, Roth et al. (2019) use (1 − exp(−γi)) rather than αi as a temporal ﬁlter for B(t) ij . We made this change so that αi makes sense in terms of the α in the forward dynamics of the network and RFLO. Of course, these are equivalent via γi =− log(1− α...

work page 2015