Test-time training with KV binding reduces to learned linear attention.
hub Mixed citations
Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.
Gated DeltaNet-2 decouples channel-wise erase and write gates in linear attention, generalizing prior DeltaNet and KDA models while showing stronger results on language modeling and long-context retrieval at 1.3B scale.
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.
citing papers explorer
-
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Gated DeltaNet-2 decouples channel-wise erase and write gates in linear attention, generalizing prior DeltaNet and KDA models while showing stronger results on language modeling and long-context retrieval at 1.3B scale.