hub Canonical reference

Zoology: Measuring and improving recall in efficient language models

Simran Arora, Sabri Eyuboglu, et al · 2023 · arXiv 2312.04927

Canonical reference. 83% of citing Pith papers cite this work as background.

18 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 5 support 1

representative citing papers

Parallax: Parameterized Local Linear Attention for Language Modeling

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Parallax is a scalable parameterized local linear attention variant that improves LLM pretraining perplexity at 0.6B/1.7B scales with a hardware-aware kernel and shows gains under parameter- and compute-matched controls.

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

cs.LG · 2026-05-11 · conditional · novelty 7.0

VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy over standard linear attention and DeltaNet.

Long Context Pre-Training with Lighthouse Attention

cs.CL · 2026-05-07 · conditional · novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.

A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

HOLA pairs a compressive delta-rule recurrent state with a residual-selected exact KV cache and decoupled RMSNorm-gamma read, yielding lower perplexity than both standard linear attention and full-attention baselines on Wikitext and LAMBADA plus stronger needle-in-haystack recall.

Dynamic Short Convolutions Improve Transformers

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.

Blurry Window Attention

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Blurry Window Attention stores a frequency window and reconstructs blurry KV history via Dirichlet kernel interpolation, achieving 8x better state efficiency than sliding window attention on the MQAR synthetic task.

Memory by Design: Probabilistic Sequence Layers

stat.ML · 2026-05-29 · unverdicted · novelty 6.0

The design-model framework unifies sub-quadratic sequence models as Bayesian filters and introduces a covariance-tracking Bayesian Layer that improves retrieval robustness beyond training regimes on MQAR and RULER benchmarks.

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

cs.LG · 2025-11-26 · unverdicted · novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

An Empirical Study of Mamba-based Language Models

cs.LG · 2024-06-12 · accept · novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.

Gated Linear Attention Transformers with Hardware-Efficient Training

cs.LG · 2023-12-11 · unverdicted · novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

Q-Delta: Beyond Key-Value Associative State Evolution

cs.AI · 2026-06-07 · unverdicted · novelty 5.0

Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.

Adaptive Memory Decay for Log-Linear Attention

cs.LG · 2026-05-07 · conditional · novelty 5.0

Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

Sessa: Selective State Space Attention

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

Gated Delta Networks: Improving Mamba2 with Delta Rule

cs.CL · 2024-12-09 · unverdicted · novelty 5.0

Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.

citing papers explorer

Showing 12 of 12 citing papers after filters.

Parallax: Parameterized Local Linear Attention for Language Modeling cs.LG · 2026-05-27 · unverdicted · none · ref 1
Parallax is a scalable parameterized local linear attention variant that improves LLM pretraining perplexity at 0.6B/1.7B scales with a hardware-aware kernel and shows gains under parameter- and compute-matched controls.
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers cs.LG · 2026-05-11 · conditional · none · ref 1
VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy over standard linear attention and DeltaNet.
Dynamic Short Convolutions Improve Transformers cs.LG · 2026-06-02 · unverdicted · none · ref 155
Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.
Blurry Window Attention cs.LG · 2026-05-31 · unverdicted · none · ref 23
Blurry Window Attention stores a frequency window and reconstructs blurry KV history via Dirichlet kernel interpolation, achieving 8x better state efficiency than sliding window attention on the MQAR synthetic task.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 2
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 26
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 1
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
An Empirical Study of Mamba-based Language Models cs.LG · 2024-06-12 · accept · none · ref 3
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 1
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Adaptive Memory Decay for Log-Linear Attention cs.LG · 2026-05-07 · conditional · none · ref 21
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 25
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 51
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

Zoology: Measuring and improving recall in efficient language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer