Streaming Reinforcement Learning under Partial Observability with Real-Time Recurrent Learning

Aryaman Reddi; Carlo D'Eramo; Jan Peters; Noah Farr

arxiv: 2605.24709 · v1 · pith:BNY7IO46new · submitted 2026-05-23 · 💻 cs.LG

Streaming Reinforcement Learning under Partial Observability with Real-Time Recurrent Learning

Noah Farr , Aryaman Reddi , Carlo D'Eramo , Jan Peters This is my paper

Pith reviewed 2026-06-30 14:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords streaming reinforcement learningpartial observabilityreal-time recurrent learningrecurrent trace unitsPOMDPonline learningdiagonal recurrent networks

0 comments

The pith

Recurrent trace units enable exact real-time recurrent learning for streaming RL in partially observable environments with linear complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Streaming reinforcement learning processes experiences one at a time without replay buffers, which works for fully observable tasks but has been blocked in partially observable settings because truncated backpropagation collapses and exact real-time recurrent learning is too costly. The paper establishes that recurrent trace units, a diagonal recurrent architecture, remove this barrier by supporting exact RTRL at linear time and memory cost in the parameter count. The units integrate directly into existing streaming algorithms for both discrete and continuous control. On a MemoryChain test with increasing chain lengths, the method maintains performance while feedforward, GRU, and other RTU baselines using one-step truncation collapse. It also reaches competitive levels with batched PPO on POPGym tasks and recovers a substantial share of batched performance on masked MuJoCo without any replay buffer.

Core claim

Recurrent trace units are a diagonal recurrent architecture that enables exact RTRL with linear time and memory complexity in the parameter count, and they integrate cleanly into existing streaming algorithms across both discrete and continuous control under partial observability.

What carries the argument

recurrent trace units, a diagonal recurrent architecture that enables exact RTRL with linear time and memory complexity in the parameter count

If this is right

Performance holds on MemoryChain diagnostic tasks with chain lengths from 2 to 128 where streaming TBPTT(1) baselines collapse.
The streaming method matches batched PPO on five POPGym tasks.
On partially observable MuJoCo continuous control the approach recovers a substantial fraction of batched performance without replay buffers or batched updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear complexity may open streaming RL to real-time agents that must act on partial observations without storing large buffers.
Diagonal structure could be tested on longer-horizon or higher-dimensional POMDPs to check where it ceases to suffice.
Integration with other online methods might extend the same linear RTRL benefit beyond the tested control settings.

Load-bearing premise

The diagonal recurrent architecture supplies enough expressivity to capture the temporal dependencies needed for the POMDP tasks without full recurrent connections.

What would settle it

A POMDP task whose solution requires non-diagonal temporal mixing where recurrent trace units fail to learn but a full recurrent network succeeds would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.24709 by Aryaman Reddi, Carlo D'Eramo, Jan Peters, Noah Farr.

**Figure 3.** Figure 3: IQM episodic return over 5 seeds with shaded standard error on four MuJoCo tasks under [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Streaming reinforcement learning has emerged as an online learning paradigm that conforms to the restrictions of natural learning agents that process data incrementally, i.e. with a batch size of 1 and no replay buffer. While streaming RL has recently been shown to scale with deep function approximation with full observability, partially observable settings have remained out of reach. Truncated backpropagation through time collapses to a one-step gradient horizon under the streaming setting, and exact real-time recurrent learning is prohibitively expensive. We close this gap using recurrent trace units, a diagonal recurrent architecture that enables exact RTRL with linear time and memory complexity in the parameter count, and show that they integrate cleanly into existing streaming algorithms across both discrete and continuous control. On a MemoryChain diagnostic with chain lengths from 2 to 128, our method sustains performance where streaming TBPTT(1) baselines using feedforward, GRU, and RTU networks collapse. On five POPGym tasks and on partially observable MuJoCo continuous control, the streaming approach is competitive with batched PPO on POPGym and recovers a substantial fraction of batched performance on masked MuJoCo, despite using no replay buffer or batched updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces diagonal recurrent trace units to make exact RTRL feasible in streaming POMDPs and shows it outperforming TBPTT(1) on MemoryChain while staying competitive on POPGym and masked MuJoCo.

read the letter

The core contribution is a diagonal recurrent architecture called recurrent trace units that delivers exact real-time recurrent learning at linear cost in parameters, time, and memory. This lets them run streaming RL (batch size 1, no replay) on partially observable problems where truncated BPTT collapses to one-step updates and full RTRL was previously intractable.

They integrate the units into existing streaming algorithms and report clear wins on the MemoryChain diagnostic across chain lengths 2–128, plus competitive results against batched PPO on five POPGym tasks and partial recovery of batched performance on masked MuJoCo. The experiments directly compare against feedforward, GRU, and RTU baselines under the streaming constraint, which is the right test.

The main limitation is the strictly diagonal recurrence: hidden units evolve independently with no off-diagonal mixing. This works for the chosen benchmarks, but the paper does not show that the restriction is harmless for POMDPs whose belief states require nonlinear combination across memory channels. The empirical evidence is task-specific rather than supported by a capacity argument or counter-example.

The work is aimed at researchers building online agents that must learn incrementally under partial observability. Anyone already experimenting with streaming RL or RTRL will find the benchmarks and integration details useful. The central claim is concrete enough and the gap it targets is real, so it deserves a full referee process even if the diagonal design needs closer scrutiny on expressivity.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes recurrent trace units (RTUs), a diagonal recurrent neural network architecture that permits exact real-time recurrent learning (RTRL) with linear time and memory complexity in the number of parameters. The authors integrate RTUs into streaming RL algorithms and evaluate them on a MemoryChain diagnostic task with varying chain lengths, five tasks from the POPGym suite, and partially observable MuJoCo environments, claiming competitive performance with batched methods despite using batch size 1 and no replay buffer.

Significance. Should the central claims hold, the work would be significant for advancing streaming reinforcement learning to partially observable settings, where previous approaches either truncate gradients or incur prohibitive costs. The linear-complexity exact RTRL is a clear technical advance, and the empirical results on both diagnostic and standard benchmarks provide concrete evidence of the method's viability in discrete and continuous control. The absence of replay buffers aligns with the streaming paradigm and strengthens the contribution.

major comments (2)

[§3 (Recurrent Trace Units)] §3 (Recurrent Trace Units definition): The recurrence matrix is strictly diagonal, so hidden units evolve independently with no direct mixing. No representational-capacity argument, proof, or counter-example is supplied showing this restriction is without loss of generality for POMDPs whose belief states require nonlinear cross-channel interactions; this assumption is load-bearing for the claim that RTUs suffice across the evaluated tasks.
[§4.2 (POPGym and masked MuJoCo results)] §4.2 (POPGym and masked MuJoCo results): The claim that the streaming RTU approach 'recovers a substantial fraction' of batched PPO performance is central to the empirical contribution, yet the text supplies neither the exact recovered fractions, seed-wise standard deviations, nor an ablation contrasting diagonal RTU against a non-diagonal recurrent baseline, preventing isolation of the diagonal restriction's effect.

minor comments (2)

[Abstract] Abstract: the phrase 'linear time and memory complexity in the parameter count' would benefit from explicit big-O notation and a direct contrast to the quadratic cost of standard RTRL.
[§5 (Discussion)] §5 (Discussion): a brief reference to prior diagonal or low-rank recurrent architectures would help situate the novelty of the RTU construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: §3 (Recurrent Trace Units definition): The recurrence matrix is strictly diagonal, so hidden units evolve independently with no direct mixing. No representational-capacity argument, proof, or counter-example is supplied showing this restriction is without loss of generality for POMDPs whose belief states require nonlinear cross-channel interactions; this assumption is load-bearing for the claim that RTUs suffice across the evaluated tasks.

Authors: We agree that the manuscript does not supply a general proof or counter-example establishing that the diagonal restriction is without loss of generality. The diagonal form is deliberately chosen to obtain exact RTRL at linear complexity; we will revise §3 to state this motivation explicitly, acknowledge that direct nonlinear cross-channel mixing in the recurrence is precluded, and clarify that we make no universality claim. We will instead highlight that the combination of input projections, nonlinear activations, and per-unit traces enables effective belief-state tracking on the evaluated POMDPs, as demonstrated by the MemoryChain and POPGym results. A short limitations paragraph will be added noting that more complex belief states may require richer recurrence. revision: yes
Referee: §4.2 (POPGym and masked MuJoCo results): The claim that the streaming RTU approach 'recovers a substantial fraction' of batched PPO performance is central to the empirical contribution, yet the text supplies neither the exact recovered fractions, seed-wise standard deviations, nor an ablation contrasting diagonal RTU against a non-diagonal recurrent baseline, preventing isolation of the diagonal restriction's effect.

Authors: We accept that the current text is insufficiently quantitative. In the revised manuscript we will (i) report the precise average recovered fraction of batched PPO performance together with per-task values, (ii) include seed-wise standard deviations for all reported curves, and (iii) add an ablation on a representative subset of tasks that contrasts diagonal RTU against a non-diagonal recurrent baseline (where the latter remains computationally tractable). These additions will allow readers to assess the practical impact of the diagonal restriction more precisely. revision: yes

Circularity Check

0 steps flagged

No circularity: new diagonal architecture derived and tested independently on external benchmarks

full rationale

The paper introduces recurrent trace units as a novel diagonal recurrent architecture enabling exact RTRL with linear complexity, then evaluates the resulting streaming RL method on MemoryChain (chain lengths 2-128), five POPGym tasks, and partially observable MuJoCo. These are external benchmarks with no indication that performance metrics or architectural claims reduce by construction to fitted parameters, self-citations, or renamed inputs. The abstract and description present the method as a self-contained proposal whose correctness is assessed via independent empirical comparison to baselines like TBPTT(1), GRU, and batched PPO. No load-bearing derivation step is shown to collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the RTU architecture itself; standard RL assumptions such as Markovian transitions under partial observability are implicit but not detailed.

invented entities (1)

recurrent trace unit no independent evidence
purpose: Diagonal recurrent architecture enabling exact RTRL with linear time and memory complexity
New architecture introduced to solve the streaming POMDP RTRL problem; no independent evidence outside the paper's empirical results is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5742 in / 1269 out tokens · 31321 ms · 2026-06-30T14:32:49.725600+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Layer Normalization

URL https://arxiv.org/abs/1607.06450. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

URLhttps://arxiv. org/abs/1409.1259. Esraa Elelimy, Adam White, Michael Bowling, and Martha White. Real-time recurrent learning us- ing trace units in reinforcement learning. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

org/abs/2507.09087

URLhttps://arxiv. org/abs/2507.09087. Mohamed Elsayed, Gautham Vasan, and A. Rupam Mahmood. Streaming deep reinforcement learn- ing finally works.arXiv preprint arXiv:2410.14606,

work page arXiv
[4]

Long Short-Term Memory

DOI: 10.1162/neco.1997.9.8.1735. Kazuki Irie, Anand Gopalakrishnan, and Jürgen Schmidhuber. Exploring the promise and limits of real-time recurrent learning. InInternational Conference on Learning Representations,

work page doi:10.1162/neco.1997.9.8.1735 1997
[5]

URL https://arxiv.org/abs/2303.06349. Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvári, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado van Hasselt. Behaviour suite for reinforcement learning. InInternational Conference on Learning Representations,

work page arXiv
[6]

Proximal Policy Optimization Algorithms

URLhttps://arxiv.org/abs/1707.06347. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Layer Normalization

URL https://arxiv.org/abs/1607.06450. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

URLhttps://arxiv. org/abs/1409.1259. Esraa Elelimy, Adam White, Michael Bowling, and Martha White. Real-time recurrent learning us- ing trace units in reinforcement learning. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

org/abs/2507.09087

URLhttps://arxiv. org/abs/2507.09087. Mohamed Elsayed, Gautham Vasan, and A. Rupam Mahmood. Streaming deep reinforcement learn- ing finally works.arXiv preprint arXiv:2410.14606,

work page arXiv

[4] [4]

Long Short-Term Memory

DOI: 10.1162/neco.1997.9.8.1735. Kazuki Irie, Anand Gopalakrishnan, and Jürgen Schmidhuber. Exploring the promise and limits of real-time recurrent learning. InInternational Conference on Learning Representations,

work page doi:10.1162/neco.1997.9.8.1735 1997

[5] [5]

URL https://arxiv.org/abs/2303.06349. Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvári, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado van Hasselt. Behaviour suite for reinforcement learning. InInternational Conference on Learning Representations,

work page arXiv

[6] [6]

Proximal Policy Optimization Algorithms

URLhttps://arxiv.org/abs/1707.06347. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA,

work page internal anchor Pith review Pith/arXiv arXiv