pith. machine review for the scientific record. sign in

hub

Zoology: Measuring and improving recall in efficient language models

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

fields

cs.LG 7 cs.CL 3

representative citing papers

Long Context Pre-Training with Lighthouse Attention

cs.CL · 2026-05-07 · conditional · novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.

Adaptive Memory Decay for Log-Linear Attention

cs.LG · 2026-05-07 · conditional · novelty 5.0

Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.

Sessa: Selective State Space Attention

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

Gated Delta Networks: Improving Mamba2 with Delta Rule

cs.CL · 2024-12-09 · unverdicted · novelty 5.0

Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.

citing papers explorer

Showing 10 of 10 citing papers.

  • Variational Linear Attention: Stable Associative Memory for Long-Context Transformers cs.LG · 2026-05-11 · conditional · none · ref 1

    VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy over standard linear attention and DeltaNet.

  • Long Context Pre-Training with Lighthouse Attention cs.CL · 2026-05-07 · conditional · none · ref 2

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.

  • OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 2

    OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

  • HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 26

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  • Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 5

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  • Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 1

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  • Adaptive Memory Decay for Log-Linear Attention cs.LG · 2026-05-07 · conditional · none · ref 21

    Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.

  • MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 25

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  • Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 51

    Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

  • Gated Delta Networks: Improving Mamba2 with Delta Rule cs.CL · 2024-12-09 · unverdicted · none · ref 294

    Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.