pith. machine review for the scientific record. sign in

citation dossier

Griffin: Mixing gated linear recurrences with local attention for efficient language models

S · 2024 · arXiv 2402.19427

17Pith papers citing it
17reference links
cs.LGtop field · 10 papers
UNVERDICTEDtop verdict bucket · 17 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.LG (10 papers). The largest review-status bucket among citing papers is UNVERDICTED (17 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

years

2026 16 2024 1

verdicts

UNVERDICTED 17

representative citing papers

A Single-Layer Model Can Do Language Modeling

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

Optimal Decay Spectra for Linear Recurrences

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval across Mamba-2, RWKV-7 and similar models.

Kaczmarz Linear Attention

cs.LG · 2026-05-09 · unverdicted · novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.

TAPNext++: What's Next for Tracking Any Point (TAP)?

cs.CV · 2026-04-12 · unverdicted · novelty 5.0

TAPNext++ trains recurrent transformers on 1024-frame sequences with geometric augmentations and occluded-point supervision to achieve new state-of-the-art point tracking on long videos while adding a re-detection metric.

citing papers explorer

Showing 17 of 17 citing papers.