A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets

Wanyun Cui

arxiv: 2607.02303 · v1 · pith:KB3NYFULnew · submitted 2026-07-02 · 💻 cs.AI

A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets

Wanyun Cui This is my paper

Pith reviewed 2026-07-03 13:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords linear attentionhippocampal memoryexact KV cacheprediction residualrecurrent statelong context retrievallanguage modelingcomplementary learning systems

0 comments

The pith

Linear attention gains an exact hippocampal cache that stores what its recurrent state overwrites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear attention compresses the full prefix into a fixed recurrent state for constant memory, but this overwrites earlier key-value associations when many compete. The paper proposes HOLA, which keeps the usual compressive state and adds a bounded exact KV cache. The cache selects and stores tokens based on the magnitude of the prediction residual they produce, then reads them with a separate normalization step to produce sharp retrieval. This hybrid approach is evaluated at 340M parameters after training on 15B tokens, showing lower perplexity than both the linear baseline and a full-attention model plus stronger long-context recall. A sympathetic reader would care because the design keeps linear scaling while recovering the exact facts that pure recurrent states lose.

Core claim

HOLA keeps the delta-rule state as a compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory. The cache writes without a learned eviction module, keeping tokens with large beta * ||e||, the prediction residual actually committed to the state; a decoupled RMSNorm-gamma cache read then turns these exact KV pairs into sharp retrieval rather than soft averaging.

What carries the argument

The hippocampal complement: a bounded exact KV cache that writes high-residual tokens via the non-learned rule beta * ||e|| and reads them via decoupled RMSNorm-gamma.

If this is right

Wikitext perplexity drops from 27.32 to 22.92, below the 26.88 of a matched full-attention Transformer++.
LAMBADA perplexity improves from 30.95 to 30.26.
Linear in-context retrieval accuracy reaches the best reported level among linear models.
Needle-in-a-haystack recall on RULER stays robust out to 32k tokens, sixteen times the training length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-driven cache could be added to other linear recurrent architectures such as state-space models to separate compressible patterns from specific facts.
If the cache bound is kept small relative to the number of distinct associations at test time, retrieval accuracy would still degrade once the cache fills.
The non-learned write rule implies that prediction-residual magnitude alone is often a sufficient signal for deciding which associations need exact storage.

Load-bearing premise

The non-learned cache-write rule based on large prediction residuals plus the decoupled RMSNorm-gamma read will reliably capture and surface the associations that the recurrent state forgets without a learned eviction policy or new failure modes at longer contexts.

What would settle it

A controlled experiment in which needle-in-a-haystack recall collapses once the number of competing facts exceeds the fixed cache size, even when residuals remain large, would falsify the claim that the write-and-read mechanism reliably recovers forgotten associations.

Figures

Figures reproduced from arXiv: 2607.02303 by Wanyun Cui.

**Figure 1.** Figure 1: HOLA lowers perplexity and improves length-robust needle recall. (a) At 340M, HOLA reduces Wikitext perplexity from 27.3 to 22.9 (−16.1%), below full-attention Transformer++ (26.9). (b) On RULER S-NIAH-1, HOLA remains stronger than GDN and HOLA+recency as context grows to 32k tokens. 1 INTRODUCTION The human brain does not rely on a single memory. A hippocampus-centered system can record a specific, novel … view at source ↗

**Figure 2.** Figure 2: HOLA: semiparametric test-time memory. Every token updates the recurrent state memory (≈ neocortex; lossy, O(1)), while the tokens with large delta-rule write magnitude β·∥e∥ are additionally kept as exact KV pairs in a bounded exact-KV memory (≈ hippocampus). The read-out follows the semiparametric form ot = q ⊤ t St + λtgt(qt): a compressive state estimate plus a non-parametric exact-KV read, instantiate… view at source ↗

read the original abstract

Linear-attention and state-space language models compress the prefix into a fixed-size recurrent state, yielding O(1) memory at the cost of a lossy exact memory: when many key--value associations compete, earlier facts are overwritten and needle recall degrades. Inspired by Complementary Learning Systems, we give linear attention a hippocampal complement. HOLA (Hippocampal Linear Attention) keeps the usual delta-rule state as a compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory: the state models linearly compressible structure, while the cache stores associations that should not be forced through that state. The cache writes without a learned eviction module, keeping tokens with large beta * ||e||, the prediction residual actually committed to the state; a decoupled RMSNorm-gamma cache read then turns these exact KV pairs into sharp retrieval rather than soft averaging. At 340M parameters trained on 15B SlimPajama tokens, HOLA lowers Wikitext perplexity from 27.32 to 22.92 (-16.1%), below a full-attention Transformer++ (26.88), and improves LAMBADA perplexity from 30.95 to 30.26. It also achieves the best linear in-context retrieval and remains much more robust than GDN or a matched HOLA+recency cache on RULER needle-in-a-haystack recall out to 32k tokens (16x its training length).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HOLA adds a residual-driven exact cache to delta-rule linear attention and posts clear benchmark gains, but the fixed write rule lacks ablations and the gains rest on untested assumptions about what the state forgets.

read the letter

The paper's main contribution is a concrete hybrid: keep the usual compressive delta-rule state and add a bounded exact KV cache that admits tokens when beta times the residual norm is large, then reads them with a decoupled RMSNorm-gamma. At 340M parameters on 15B SlimPajama tokens this yields a 16% WikiText perplexity drop and better RULER recall out to 32k than the pure linear baselines or a recency cache.

The design is new in the specific combination of residual admission, bounded exact store, and the read norm on top of linear attention. The empirical numbers are direct held-out measurements rather than fitted quantities, which is a plus.

The soft spots are exactly where the stress-test note flags them. There are no ablations on beta, cache bound, or the residual rule itself, no error bars, and no analysis of how the fixed admission interacts with the state at lengths beyond training. The claim that large residuals reliably mark the associations the state overwrites is plausible but not derived or stress-tested in the reported results.

This is for people working on efficient long-context language models who already follow linear attention and state-space work. A reader who wants to try the cache rule on their own setup would find the description usable.

It deserves peer review. The architecture is simple enough to reproduce and the headline numbers are large enough to be worth checking with proper controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces HOLA, a semiparametric extension to linear attention that augments the standard delta-rule recurrent state with a bounded exact KV cache. Tokens are written to the cache when beta * ||e|| is large (where e is the prediction residual committed by the state), and retrieved via a decoupled RMSNorm-gamma mechanism for sharp, non-averaged access. At 340M parameters trained on 15B SlimPajama tokens, the model reports Wikitext perplexity reduction from 27.32 to 22.92, LAMBADA improvement from 30.95 to 30.26, superior linear in-context retrieval, and improved RULER needle recall out to 32k tokens relative to GDN and recency-cache baselines.

Significance. If the empirical claims hold under fuller validation, the result would be significant: it offers a concrete, non-learned mechanism to mitigate overwriting in fixed-size linear states without requiring a full attention matrix or learned eviction policy. The concrete benchmark deltas and the 16x extrapolation on RULER are notable strengths; the approach directly addresses a known limitation of linear attention while preserving O(1) inference cost for the recurrent component.

major comments (3)

[§3.2–3.3] §3.2–3.3: The cache-write rule (tokens with large beta * ||e||) and decoupled RMSNorm-gamma read are presented as fixed, non-learned components, yet no derivation, correlation analysis, or ablation demonstrates why residual magnitude preferentially identifies associations the delta-rule state overwrites. This is load-bearing for the central claim that the cache reliably supplies exactly what the recurrent state forgets.
[Experiments / §4] Experiments (reported numbers in abstract and §4): The headline results (Wikitext 27.32→22.92, LAMBADA 30.95→30.26, RULER gains) are given without error bars, standard deviations across seeds, or ablation tables on cache_bound, beta threshold, or cache size. This absence makes it impossible to assess whether the reported margins are stable or sensitive to the two free hyperparameters.
[§4 / RULER] RULER evaluation (abstract and §4): While improved recall to 32k is claimed, there is no reported analysis of cache occupancy, eviction frequency, or interference patterns at lengths 16× training context; without this, it remains unclear whether the bounded cache introduces new failure modes that offset the claimed robustness.

minor comments (2)

[§3.2] Notation: the symbol beta is introduced without an explicit equation defining its scaling relative to the delta-rule update; a one-line definition would improve reproducibility.
[§4] The manuscript would benefit from a short table listing all free parameters (including cache_bound) and their chosen values for the 340M run.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the detailed, constructive comments. We address each major point below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [§3.2–3.3] The cache-write rule (tokens with large beta * ||e||) and decoupled RMSNorm-gamma read are presented as fixed, non-learned components, yet no derivation, correlation analysis, or ablation demonstrates why residual magnitude preferentially identifies associations the delta-rule state overwrites. This is load-bearing for the central claim that the cache reliably supplies exactly what the recurrent state forgets.

Authors: The mechanism is motivated by the Complementary Learning Systems framework, positing that the delta-rule state captures compressible structure while the residual identifies non-compressible associations prone to overwrite. The manuscript supports this via consistent gains on perplexity, LAMBADA, and RULER recall. We agree a dedicated correlation analysis and ablation against random/recency baselines would strengthen the claim and will add both in §3 and an appendix. revision: yes
Referee: [Experiments / §4] The headline results (Wikitext 27.32→22.92, LAMBADA 30.95→30.26, RULER gains) are given without error bars, standard deviations across seeds, or ablation tables on cache_bound, beta threshold, or cache size. This absence makes it impossible to assess whether the reported margins are stable or sensitive to the two free hyperparameters.

Authors: The main results are single-run due to compute scale. We will add a hyperparameter sensitivity table for cache size and beta threshold in the revision and explicitly note the single-seed limitation; full multi-seed error bars would require additional training runs beyond current resources. revision: partial
Referee: [§4 / RULER] While improved recall to 32k is claimed, there is no reported analysis of cache occupancy, eviction frequency, or interference patterns at lengths 16× training context; without this, it remains unclear whether the bounded cache introduces new failure modes that offset the claimed robustness.

Authors: We will add cache occupancy, eviction frequency, and interference analysis at 16k–32k lengths (including visualizations) to the RULER section and appendix to demonstrate that the bounded cache does not introduce offsetting failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out data with no self-referential derivation

full rationale

The paper reports measured perplexity and recall improvements on Wikitext, LAMBADA, and RULER benchmarks after training on SlimPajama. The cache-write rule (beta * ||e||) and RMSNorm-gamma read are fixed design choices whose performance is evaluated externally rather than derived from parameters fitted to the target metrics. No equations reduce a claimed prediction to the inputs by construction, and no self-citation chain is invoked to justify the central mechanism. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The design rests on the standard delta-rule update for the recurrent state and on the assumption that a fixed cache size plus residual-based admission is sufficient; no new physical constants or learned eviction parameters are introduced.

free parameters (2)

beta
Scaling factor in the cache-write condition beta * ||e||; its value is not stated as derived from first principles.
cache_bound
Maximum number of exact KV pairs stored; chosen as a hyperparameter rather than derived.

axioms (1)

domain assumption The recurrent state is updated by the standard delta rule of linear attention.
Invoked as the baseline compressive memory that the hippocampal cache complements.

pith-pipeline@v0.9.1-grok · 5792 in / 1403 out tokens · 40583 ms · 2026-07-03T13:43:04.467493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 24 canonical work pages · 12 internal anchors

[1]

Zoology: Measuring and improving recall in efficient language models.arXiv:2312.04927,

Simran Arora, Sabri Eyuboglu, et al. Zoology: Measuring and improving recall in efficient language models.arXiv:2312.04927,

work page arXiv
[2]

Sebastian Borgeaud et al

arXiv:2402.18668. Sebastian Borgeaud et al. Improving language models by retrieving from trillions of tokens. In ICML,

work page arXiv
[3]

Improving language models by retrieving from trillions of tokens

arXiv:2112.04426. Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InICML,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

arXiv:2405.21060. Soham De et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv:2402.19427,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, and Yu Cheng. Native hybrid attention for efficient sequence modeling.arXiv:2510.07019,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Artificial hip- pocampus networks for efficient long-context modeling.arXiv:2510.07318,

Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, and Lai Wei. Artificial hip- pocampus networks for efficient long-context modeling.arXiv:2510.07318,

work page arXiv
[7]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Mutian He and Philip N. Garner. Alleviating forgetfulness of linear attention by hybrid sparse attention and contextualized learnable token eviction.arXiv:2510.20787,

work page arXiv
[9]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh et al. RULER: What’s the real context size of your long-context language models? arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Albert Q

arXiv:2402.01032. Albert Q. Jiang et al. Mistral 7b.arXiv:2310.06825,

work page arXiv
[11]

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis

arXiv:2006.16236. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InICLR,

work page arXiv 2006
[12]

arXiv preprint arXiv:1911.00172 , year=

arXiv:1911.00172. Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, et al. Kimi Linear: An expressive, efficient attention architecture.arXiv:2510.26692,

work page arXiv 1911
[13]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber et al. Jamba: A hybrid transformer-mamba language model.arXiv:2403.19887,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv:1711.05101. James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are complemen- tary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419–457,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Landmark attention: Random-access infinite context length for transformers.arXiv:2305.16300,

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers.arXiv:2305.16300,

work page arXiv
[16]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Hopfield Networks is All You Need

arXiv:2008.02217. Liliang Ren et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv:2406.07522,

work page internal anchor Pith review Pith/arXiv arXiv 2008
[18]

Daria Soboleva et al

arXiv:2102.11174. Daria Soboleva et al. SlimPajama: A 627b token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama,

work page arXiv
[19]

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

Neehal Tumma, Noel Loo, and Daniela Rus. Preconditioned DeltaNet: Curvature-aware sequence modeling for linear recurrences.arXiv:2604.21100,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

RATTENTION: Towards the minimal sliding window size in local-global attention models.arXiv:2506.15545,

Bailin Wang, Chang Lan, Chong Wang, and Ruoming Pang. RATTENTION: Towards the minimal sliding window size in local-global attention models.arXiv:2506.15545,

work page arXiv
[21]

arXiv preprint arXiv:2203.08913 , year=

arXiv:2203.08913. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR,

work page arXiv
[22]

Efficient Streaming Language Models with Attention Sinks

arXiv:2309.17453. Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InICML, 2024b. arXiv:2312.06635. Songlin Yang, Bailin Wang, Yu Zhang, et ...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu

arXiv:2102.02557. Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear- time sequence modeling.arXiv:2409.07146,

work page arXiv
[24]

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

arXiv:2306.14048. A SCALE CONFIGURATIONS Table 6 lists the architecture, corpus size, and context length for the scaling comparison in Table

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The 46M row uses a smaller 12-layer,d model=512architecture trained on FineWeb-Edu for 0.5B tokens at ctx 4096; this is the scale used for the component studies in Tables 4–5

Scaled model layers corpus train tokens ctx 46M 512 12 FineWeb-Edu 0.5B 4096 170M 1024 12 SlimPajama 6.22B 2048 340M 1024 24 SlimPajama 15.0B 2048 For 170M and 340M, the architecture family follows the GDN recipe:4heads×head-dim256, expand_v=1,hidden_ratio=4, conv 4, tied embeddings, vocabulary32000, Mistral tokenizer, and AdamW with peak lr4×10 −4. The 4...

2048

[1] [1]

Zoology: Measuring and improving recall in efficient language models.arXiv:2312.04927,

Simran Arora, Sabri Eyuboglu, et al. Zoology: Measuring and improving recall in efficient language models.arXiv:2312.04927,

work page arXiv

[2] [2]

Sebastian Borgeaud et al

arXiv:2402.18668. Sebastian Borgeaud et al. Improving language models by retrieving from trillions of tokens. In ICML,

work page arXiv

[3] [3]

Improving language models by retrieving from trillions of tokens

arXiv:2112.04426. Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InICML,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

arXiv:2405.21060. Soham De et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv:2402.19427,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, and Yu Cheng. Native hybrid attention for efficient sequence modeling.arXiv:2510.07019,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Artificial hip- pocampus networks for efficient long-context modeling.arXiv:2510.07318,

Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, and Lai Wei. Artificial hip- pocampus networks for efficient long-context modeling.arXiv:2510.07318,

work page arXiv

[7] [7]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Mutian He and Philip N. Garner. Alleviating forgetfulness of linear attention by hybrid sparse attention and contextualized learnable token eviction.arXiv:2510.20787,

work page arXiv

[9] [9]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh et al. RULER: What’s the real context size of your long-context language models? arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Albert Q

arXiv:2402.01032. Albert Q. Jiang et al. Mistral 7b.arXiv:2310.06825,

work page arXiv

[11] [11]

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis

arXiv:2006.16236. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InICLR,

work page arXiv 2006

[12] [12]

arXiv preprint arXiv:1911.00172 , year=

arXiv:1911.00172. Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, et al. Kimi Linear: An expressive, efficient attention architecture.arXiv:2510.26692,

work page arXiv 1911

[13] [13]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber et al. Jamba: A hybrid transformer-mamba language model.arXiv:2403.19887,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv:1711.05101. James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are complemen- tary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419–457,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Landmark attention: Random-access infinite context length for transformers.arXiv:2305.16300,

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers.arXiv:2305.16300,

work page arXiv

[16] [16]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Hopfield Networks is All You Need

arXiv:2008.02217. Liliang Ren et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv:2406.07522,

work page internal anchor Pith review Pith/arXiv arXiv 2008

[18] [18]

Daria Soboleva et al

arXiv:2102.11174. Daria Soboleva et al. SlimPajama: A 627b token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama,

work page arXiv

[19] [19]

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

Neehal Tumma, Noel Loo, and Daniela Rus. Preconditioned DeltaNet: Curvature-aware sequence modeling for linear recurrences.arXiv:2604.21100,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

RATTENTION: Towards the minimal sliding window size in local-global attention models.arXiv:2506.15545,

Bailin Wang, Chang Lan, Chong Wang, and Ruoming Pang. RATTENTION: Towards the minimal sliding window size in local-global attention models.arXiv:2506.15545,

work page arXiv

[21] [21]

arXiv preprint arXiv:2203.08913 , year=

arXiv:2203.08913. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR,

work page arXiv

[22] [22]

Efficient Streaming Language Models with Attention Sinks

arXiv:2309.17453. Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InICML, 2024b. arXiv:2312.06635. Songlin Yang, Bailin Wang, Yu Zhang, et ...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu

arXiv:2102.02557. Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear- time sequence modeling.arXiv:2409.07146,

work page arXiv

[24] [24]

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

arXiv:2306.14048. A SCALE CONFIGURATIONS Table 6 lists the architecture, corpus size, and context length for the scaling comparison in Table

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

The 46M row uses a smaller 12-layer,d model=512architecture trained on FineWeb-Edu for 0.5B tokens at ctx 4096; this is the scale used for the component studies in Tables 4–5

Scaled model layers corpus train tokens ctx 46M 512 12 FineWeb-Edu 0.5B 4096 170M 1024 12 SlimPajama 6.22B 2048 340M 1024 24 SlimPajama 15.0B 2048 For 170M and 340M, the architecture family follows the GDN recipe:4heads×head-dim256, expand_v=1,hidden_ratio=4, conv 4, tied embeddings, vocabulary32000, Mistral tokenizer, and AdamW with peak lr4×10 −4. The 4...

2048