Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

Ali H. Sayed; Malek Khammassi; Michael Muehlebach; Onno Eberhard; Yike Zhao

arxiv: 2605.31261 · v1 · pith:ML6GFPHTnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI· stat.ML

Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

Yike Zhao , Onno Eberhard , Malek Khammassi , Ali H. Sayed , Michael Muehlebach This is my paper

Pith reviewed 2026-06-28 23:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords linear recurrent networkspartially observable reinforcement learninghidden Markov modelsbelief statessufficient statisticsPOMDPsrecurrent memory

0 comments

The pith

A linear recurrent filter exactly reproduces the belief vector logits of a hidden Markov model when transitions are deterministic, serving as a sufficient statistic for optimal policy learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs two linear filters to explain why linear recurrent networks perform well in partially observable reinforcement learning. The first filter matches the pre-softmax logits of the belief vector in an HMM with deterministic transitions, making it sufficient for optimal policies. The second achieves near-zero decoding error when transitions are nearly deterministic. These results hold for action-controlled HMMs with time-varying filters. Experiments confirm the filter acts as a strong feature extractor in a small RL game.

Core claim

The first linear filter exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning. The second linear filter achieves vanishing state-decoding error under a nearly deterministic transition matrix. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics.

What carries the argument

Two linear filters, the first reproducing pre-softmax belief logits exactly under deterministic transitions and the second driving state-decoding error to zero under near-determinism; both become time-varying and action-dependent in controlled settings.

If this is right

Optimal policies can be learned directly from the linear filter output without separate belief-state computation.
The filter output is a sufficient statistic, so no decision-relevant information is lost under the determinism condition.
The same linear construction applies to action-controlled HMMs by making the filter dynamics depend on the chosen action.
Linear recurrent units can serve as feature extractors that support policy learning in small POMDP environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear structure may continue to provide useful approximations even when transitions deviate moderately from determinism.
Trained linear recurrent networks on POMDP tasks might implicitly learn filters close to these constructions.
One could measure how quickly decoding error grows as transition stochasticity increases to map the boundary of the exact regime.

Load-bearing premise

The hidden Markov model's transition matrix must be deterministic or nearly deterministic for exact reproduction or vanishing error to hold.

What would settle it

Apply the linear filter to an HMM whose transition matrix has substantial stochasticity and check whether the output still matches the belief logits exactly or the decoding error stays near zero; sustained mismatch or error would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.31261 by Ali H. Sayed, Malek Khammassi, Michael Muehlebach, Onno Eberhard, Yike Zhao.

**Figure 1.** Figure 1: Long-run error probability versus 1/ε for ALF and LOF. Solid/dashed lines correspond to valid/invalid δε selections, respectively. δε = ε 1/2 , δε = 0.7 log(1/ε) . (48) The latter choice also satisfies (34); the detailed derivations are provided in Appendix D.2. In addition to valid δε choices, we also consider three invalid ones: δε = ε 2 , δε = 0, δε = 1, (49) where δε = ε 2 and δε = 0 violate (32), and… view at source ↗

**Figure 3.** Figure 3: Episodic return versus update. The shaded area is the standard deviation across 8 random seeds. ALF/LOF are constructed from environment knowledge, whereas S5 and Deep ALF are trained from scratch. the more likely it will be observed. The states 6, 11, and 12 are “trap” states, and the states 2 and 3 are goal states. Entering a trap state incurs a penalty, while entering a goal state yields a reward. An … view at source ↗

**Figure 4.** Figure 4: Illustration of a difficult case for generalization. A possible generalization to nearly-deterministic POMDPs is to assume backbone deterministic matrices D(a) = P(a) D12(a) 0 D22(a) , (53) with strictly upper-triangular D22(a) for all a ∈ A, so that transient-state probabilities go to zero after finitely many steps. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Sample evolution of the logit difference wk(x2) − wk(x1) for LOF and ALF under ε = 0.005. B. Visualization of ALF’s State Tracking In this section, we provide additional visualizations to further illustrate how ALF recovers the underlying states. B.1. Time-Invariant ALF In this section, we consider a two-state HMM: T = 1 − ε ε ε 1 − ε , E = 0.8 0.2 0.2 0.8 , (54) with ε = 0.005. In this setting, th… view at source ↗

**Figure 6.** Figure 6: Sample trajectory of true states, estimated states, actions, and observations in the RingWorld environment. C. Proofs for Section 4 C.1. Proof of Lemma 4.2 If T is a permutation matrix, there are no transient states (i.e., r = 0), and Lemma 4.2 holds trivially as T11 and T21 are empty matrices. Otherwise, T must have at least two identical columns, which implies the existence of at least one transient stat… view at source ↗

**Figure 7.** Figure 7: Per-time-step error probability for ALF and LOF under ε = 0.005. Then we prove by induction. For k = 0, the statement is ∥we0∥∞ ≤ max{∥we0∥∞, AE}, which is trivially true. Assume the hypothesis is true for k − 1, i.e., ∥we k−1∥∞ ≤ max{∥we0∥∞, AE}. From (191), we have: ∥we k∥∞ ≤ (1 − δ)∥we k−1∥∞ + δAE ≤ (1 − δ) max{∥we0∥∞, AE} + δ max{∥we0∥∞, AE} = max{∥we0∥∞, AE}. (192) Thus, the hypothesis holds for k. By… view at source ↗

**Figure 8.** Figure 8: Observation probabilities P(y = ui | x ⋆ ) for i = 1, . . . , 4 versus x ⋆ ∈ {1, . . . , 12}, computed from (195). and is further illustrated in [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: Episodic return versus update. (a) shows the full results, and (b) shows a zoomed-in view of the last 100 updates. The shaded area indicates the standard deviation across 8 random seeds. • Random δ: Λ(·), E, and V are initialized with their true values, while δ is drawn uniformly from (0, 0.5). This corresponds to an unknown perturbation pattern εQ, since the optimal δ should depend on the specific εQ. • R… view at source ↗

**Figure 10.** Figure 10: Episodic return versus update. (a) shows the full results, and (b) shows a zoomed-in view of the last 100 updates. The shaded area indicates the standard deviation across 8 random seeds. achieves noticeably better final performance, as shown in Figure 10a. Only by increasing S5’s hidden size to 256 (with substantially more parameters) does its performance approach that of Deep ALF. Furthermore, Deep ALF o… view at source ↗

read the original abstract

The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing state-decoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Linear filters match belief logits exactly only under deterministic transitions, giving a scoped justification for their use in POMDPs.

read the letter

The punchline is that linear recurrent memory exactly matches the belief logits in an HMM only when transitions are deterministic, which explains its success in those specific POMDPs but leaves the general case open.

The paper constructs two linear filters. The first one reproduces the pre-softmax logits of the belief vector exactly under a deterministic transition matrix, serving as a sufficient statistic for optimal policy learning. The second achieves vanishing decoding error when the matrix is nearly deterministic. These results carry over to action-controlled HMMs, with the filters becoming time-varying and action-dependent. They back this up with numerical experiments and show the filter working well as a feature extractor in a small reinforcement learning game.

This is a solid incremental step. It makes an explicit connection between linear recurrences and belief tracking that was missing in the RL literature, even if similar ideas exist in control theory. The construction is precise and the extension to controlled cases is useful.

The soft spot is the load-bearing assumption of determinism. Without it, the belief update involves nonlinear operations from the sum over possible previous states and the normalization constant, so a linear filter cannot match exactly. The paper handles this by focusing on the deterministic and near-deterministic regimes, but that means the justification does not apply to most real POMDPs where transitions have significant stochasticity. The experiments are in a small game, which is fine for illustration but does not provide strong evidence for broader use.

This paper is for people studying memory mechanisms in partially observable reinforcement learning. It shows clear thinking on the math side and honest scoping of the result.

I recommend sending it to peer review. The central construction is worth referee attention even with the scope limitation.

Referee Report

2 major / 2 minor

Summary. The manuscript constructs two linear filters for hidden Markov models (HMMs) to explain the effectiveness of linear recurrent networks as memory in partially observable RL. The first filter exactly reproduces the pre-softmax logits of the belief vector under a deterministic transition matrix, serving as a sufficient statistic for optimal policy learning. The second achieves vanishing state-decoding error under a nearly deterministic transition matrix. Results extend to action-controlled HMMs (with time-varying, action-dependent filters) and are illustrated via numerical experiments plus a small RL game demonstrating the filter as a feature extractor.

Significance. If the constructions and proofs hold, the work supplies a concrete theoretical account of why linear recurrences can function as (near-)exact belief-state representations in POMDPs, under clearly stated conditions on the transition dynamics. This is a positive contribution that links architectural choices directly to HMM filtering properties and could guide memory-module design; the numerical validation in an RL setting adds empirical grounding.

major comments (2)

[§3] §3 (first linear filter construction): the exact logit reproduction and sufficiency claim are derived only for deterministic transition matrices (the Bayes update reduces to a single predecessor term whose normalization cancels). The paper must state explicitly whether this premise is required for the central sufficiency result or whether the filter remains useful more broadly; without that clarification the title and introduction risk overstating generality for typical stochastic-transition POMDPs.
[§4] §4 (action-controlled extension): the time-varying filter is obtained by making the recurrence action-dependent, yet the sufficiency argument still rests on the same determinism premise. It is unclear whether the optimal-policy guarantee survives when the action also modulates the (still deterministic) transition; a short counter-example or additional derivation step would strengthen the claim.

minor comments (2)

Notation for the pre-softmax logits and the normalization constants in the belief update should be introduced once and used consistently; several equations reuse the same symbols for distinct quantities.
Figure 2 (numerical experiment) would benefit from an additional panel showing the decoding error as a function of the deviation from determinism, to quantify the “nearly deterministic” regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate clarifications into the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (first linear filter construction): the exact logit reproduction and sufficiency claim are derived only for deterministic transition matrices (the Bayes update reduces to a single predecessor term whose normalization cancels). The paper must state explicitly whether this premise is required for the central sufficiency result or whether the filter remains useful more broadly; without that clarification the title and introduction risk overstating generality for typical stochastic-transition POMDPs.

Authors: We agree that the exact reproduction of pre-softmax belief logits and the resulting sufficiency for optimal policy learning are derived specifically under deterministic transition matrices, where the Bayes update collapses to a single predecessor. The manuscript already states this assumption in the abstract and Section 3, but we will add an explicit sentence in both the introduction and Section 3 clarifying that the central sufficiency result requires deterministic transitions and that the second filter addresses the nearly-deterministic regime. This revision will remove any risk of overstating generality. revision: yes
Referee: [§4] §4 (action-controlled extension): the time-varying filter is obtained by making the recurrence action-dependent, yet the sufficiency argument still rests on the same determinism premise. It is unclear whether the optimal-policy guarantee survives when the action also modulates the (still deterministic) transition; a short counter-example or additional derivation step would strengthen the claim.

Authors: Under action-controlled deterministic transitions the same single-predecessor argument applies, with the transition matrix now selected by the current action; the linear filter therefore continues to reproduce the exact belief logits and remains a sufficient statistic. We will insert a short additional derivation paragraph in Section 4 that writes out the action-dependent Bayes update and confirms the sufficiency claim carries over directly. revision: yes

Circularity Check

0 steps flagged

No circularity; explicit construction under stated deterministic-transition assumption

full rationale

The paper constructs linear filters mathematically to match HMM belief logits exactly when the transition matrix is deterministic, and to achieve vanishing error when nearly deterministic. This is presented as a direct derivation from the model equations rather than a fit to data or a redefinition of the target quantity. No load-bearing self-citations, fitted-input predictions, or ansatzes smuggled via prior work are indicated in the abstract or context; the central claim is an explicit equivalence under an openly stated premise, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard HMM formulation with deterministic or nearly deterministic transition matrices; no free parameters, invented entities, or ad-hoc axioms are mentioned in the abstract.

axioms (1)

domain assumption The environment is modeled as a hidden Markov model with deterministic (or nearly deterministic) transition matrix.
This is the key modeling choice that enables the exact or vanishing-error properties of the linear filters.

pith-pipeline@v0.9.1-grok · 5677 in / 1244 out tokens · 18833 ms · 2026-06-28T23:25:21.486118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 3 internal anchors

[1]

K., Hutter, F., and Pontil, M

Grazzi, R., Siems, J., Zela, A., Franke, J. K., Hutter, F., and Pontil, M. Unlocking state-tracking in linear RNNs through negative eigenvalues.arXiv:2411.12537,

work page arXiv
[2]

Efficiently Modeling Long Sequences with Structured State Spaces

9 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning Gu, A., Goel, K., and R´e, C. Efficiently modeling long se- quences with structured state spaces.arXiv:2111.00396,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

POPGym: Benchmarking partially observable reinforcement learning.arXiv:2303.01859,

Morad, S., Kortvelesy, R., Bettini, M., Liwicki, S., and Prorok, A. POPGym: Benchmarking partially observable reinforcement learning.arXiv:2303.01859,

work page arXiv
[4]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Simplified State Space Layers for Sequence Modeling

Smith, J. T., Warrington, A., and Linderman, S. W. Simplified state space layers for sequence modeling. arXiv:2208.04933,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Additional Discussion A.1

10 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning A. Additional Discussion A.1. Observability Assumption When r= 0 and P=I , we have eE=E and eEs reduces to E since M= 1 . In this case, Assumptions 5.6a and 5.6b coincide with those of Khammassi et al. (2025, Assumptions 1 and 3). For r≥1 or M >1 , however, they are weaker...

2025
[7]

Nevertheless,ALFstill achieves good state decoding

First, although the overall trends are similar, there is a clear discrepancy between the dynamics ofALFandLOF. Nevertheless,ALFstill achieves good state decoding. Second, since δ1 < δ 2, the dynamics under δ1 appear smoother than those under δ2, as the smaller value reduces sensitivity to instantaneous (and noisy) observations. However, this smoother beha...

2023
[8]

Lemma D.3.Consider a nearly-deterministic transition matrixTunder Assumptions 5.3 and 5.6

Now, if we set α≥α 1(x) = log( X n∈Γ(x) 1 M dn(x) + 2W)−log(− 1 2 Z − 1 M 0 X n∈Γ(x) Λn(τ;x) τ dτ),(122) and substitute it back into (110), we obtain: lim ε→0 Aε(x)≤ 1 2 Z − 1 M 0 X n∈Γ(x) Λn(τ;x) τ dτ <0.(123) 22 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning Now lettingα≥max x∈X \{x1} α1(x), we conclude the proof: lim ε...

1996
[9]

= · · O(ε) D2 22 +O(ε) .(153) Similarly, forT ℓ we can get: T ℓ = · · O(ε) Dℓ 22 +O(ε) (a) = · · O(ε) O(ε) .(154) where(a)follows from the fact thatD 22 is a nilpotent matrix with indexℓ. Fork≥ℓ, we always have: T k = · · O(ε) O(ε) .(155) Therefore, in conclusion, fork≥ℓand∀x∈ X \ eX, we have: P[x ⋆ k =x] =π k(x) = (T) kπ0 x =O(ε).(156) Lemma D.5.Under a ...

2016
[10]

The total episodic reward therefore lies in PK k=1 r(x⋆ k)∈[−1,1]

The reward function is defined as r(x⋆ k) = 1 K 1{x ⋆ k is a goal state} −1{x ⋆ k is a trap state} ,(196) whereKis the episode length. The total episodic reward therefore lies in PK k=1 r(x⋆ k)∈[−1,1]. Our PPO implementations, including the encoder, S5, actor, and critic, follow Lu et al. (2023). The general hyperparameters used forLOF,ALF, andS5-based me...

2023
[11]

33 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning E.2.2. ALF-INSPIREDDEEPLEARNINGMEMORY In our experiment, all permutation matricesP(CW1) , P(CW2) , P(CCW1) , and P(CCW2) are circulant, meaning each row is a cyclic right shift of the previous one. According to Gray (2006, Theorem 3.1), any circulant matrix can be diagonal...

2006
[12]

Next, we keepΛ(·),E,V, andδlearnable, but vary their initialization schemes: • Random E: Λ(·) and V are initialized with their true values, and δ= 0.1

From Figure 9, we observe that making all parameters learnable allowsDeep ALFto slightly outperformALF, reaching performance levels close to the softmax-processedLOF. Next, we keepΛ(·),E,V, andδlearnable, but vary their initialization schemes: • Random E: Λ(·) and V are initialized with their true values, and δ= 0.1 . For E, we first draw κ∈R S×N from a s...

2048

[1] [1]

K., Hutter, F., and Pontil, M

Grazzi, R., Siems, J., Zela, A., Franke, J. K., Hutter, F., and Pontil, M. Unlocking state-tracking in linear RNNs through negative eigenvalues.arXiv:2411.12537,

work page arXiv

[2] [2]

Efficiently Modeling Long Sequences with Structured State Spaces

9 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning Gu, A., Goel, K., and R´e, C. Efficiently modeling long se- quences with structured state spaces.arXiv:2111.00396,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

POPGym: Benchmarking partially observable reinforcement learning.arXiv:2303.01859,

Morad, S., Kortvelesy, R., Bettini, M., Liwicki, S., and Prorok, A. POPGym: Benchmarking partially observable reinforcement learning.arXiv:2303.01859,

work page arXiv

[4] [4]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Simplified State Space Layers for Sequence Modeling

Smith, J. T., Warrington, A., and Linderman, S. W. Simplified state space layers for sequence modeling. arXiv:2208.04933,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Additional Discussion A.1

10 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning A. Additional Discussion A.1. Observability Assumption When r= 0 and P=I , we have eE=E and eEs reduces to E since M= 1 . In this case, Assumptions 5.6a and 5.6b coincide with those of Khammassi et al. (2025, Assumptions 1 and 3). For r≥1 or M >1 , however, they are weaker...

2025

[7] [7]

Nevertheless,ALFstill achieves good state decoding

First, although the overall trends are similar, there is a clear discrepancy between the dynamics ofALFandLOF. Nevertheless,ALFstill achieves good state decoding. Second, since δ1 < δ 2, the dynamics under δ1 appear smoother than those under δ2, as the smaller value reduces sensitivity to instantaneous (and noisy) observations. However, this smoother beha...

2023

[8] [8]

Lemma D.3.Consider a nearly-deterministic transition matrixTunder Assumptions 5.3 and 5.6

Now, if we set α≥α 1(x) = log( X n∈Γ(x) 1 M dn(x) + 2W)−log(− 1 2 Z − 1 M 0 X n∈Γ(x) Λn(τ;x) τ dτ),(122) and substitute it back into (110), we obtain: lim ε→0 Aε(x)≤ 1 2 Z − 1 M 0 X n∈Γ(x) Λn(τ;x) τ dτ <0.(123) 22 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning Now lettingα≥max x∈X \{x1} α1(x), we conclude the proof: lim ε...

1996

[9] [9]

= · · O(ε) D2 22 +O(ε) .(153) Similarly, forT ℓ we can get: T ℓ = · · O(ε) Dℓ 22 +O(ε) (a) = · · O(ε) O(ε) .(154) where(a)follows from the fact thatD 22 is a nilpotent matrix with indexℓ. Fork≥ℓ, we always have: T k = · · O(ε) O(ε) .(155) Therefore, in conclusion, fork≥ℓand∀x∈ X \ eX, we have: P[x ⋆ k =x] =π k(x) = (T) kπ0 x =O(ε).(156) Lemma D.5.Under a ...

2016

[10] [10]

The total episodic reward therefore lies in PK k=1 r(x⋆ k)∈[−1,1]

The reward function is defined as r(x⋆ k) = 1 K 1{x ⋆ k is a goal state} −1{x ⋆ k is a trap state} ,(196) whereKis the episode length. The total episodic reward therefore lies in PK k=1 r(x⋆ k)∈[−1,1]. Our PPO implementations, including the encoder, S5, actor, and critic, follow Lu et al. (2023). The general hyperparameters used forLOF,ALF, andS5-based me...

2023

[11] [11]

33 Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning E.2.2. ALF-INSPIREDDEEPLEARNINGMEMORY In our experiment, all permutation matricesP(CW1) , P(CW2) , P(CCW1) , and P(CCW2) are circulant, meaning each row is a cyclic right shift of the previous one. According to Gray (2006, Theorem 3.1), any circulant matrix can be diagonal...

2006

[12] [12]

Next, we keepΛ(·),E,V, andδlearnable, but vary their initialization schemes: • Random E: Λ(·) and V are initialized with their true values, and δ= 0.1

From Figure 9, we observe that making all parameters learnable allowsDeep ALFto slightly outperformALF, reaching performance levels close to the softmax-processedLOF. Next, we keepΛ(·),E,V, andδlearnable, but vary their initialization schemes: • Random E: Λ(·) and V are initialized with their true values, and δ= 0.1 . For E, we first draw κ∈R S×N from a s...

2048