hub Mixed citations

Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong · 2024 · arXiv 2404.07904

Mixed citation behavior. Most common role is background (60%).

15 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 3 unclear 2

representative citing papers

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

q-bio.NC · 2026-05-13 · unverdicted · novelty 7.0

SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

Universal Time Series Generation with Neural Controlled Differential Equations

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Gated DeltaNet-2 decouples channel-wise erase and write gates in linear attention, generalizing prior DeltaNet and KDA models while showing stronger results on language modeling and long-context retrieval at 1.3B scale.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

Gated Linear Attention Transformers with Hardware-Efficient Training

cs.LG · 2023-12-11 · unverdicted · novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

cs.LG · 2026-04-21 · unverdicted · novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.

Attention Residuals

cs.CL · 2026-03-16 · unverdicted · novelty 5.0

Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.

Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

cs.LG · 2025-10-30 · unverdicted · novelty 5.0

Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.

LT2: Linear-Time Looped Transformers

cs.LG · 2026-05-20

Selective Rotary Position Embedding

cs.CL · 2025-11-21

citing papers explorer

Showing 15 of 15 citing papers.

Test-Time Training with KV Binding Is Secretly Linear Attention cs.LG · 2026-02-24 · conditional · none · ref 14
Test-time training with KV binding reduces to learned linear attention.
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting q-bio.NC · 2026-05-13 · unverdicted · none · ref 5
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality cs.LG · 2024-05-31 · unverdicted · none · ref 84
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Universal Time Series Generation with Neural Controlled Differential Equations cs.LG · 2026-05-27 · unverdicted · none · ref 54
Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention cs.AI · 2026-05-21 · unverdicted · none · ref 42
Gated DeltaNet-2 decouples channel-wise erase and write gates in linear attention, generalizing prior DeltaNet and KDA models while showing stronger results on language modeling and long-context retrieval at 1.3B scale.
Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 80
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 28
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 78
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 75
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 61 · 2 links
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control cs.LG · 2026-04-21 · unverdicted · none · ref 30
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
Attention Residuals cs.CL · 2026-03-16 · unverdicted · none · ref 41
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism cs.LG · 2025-10-30 · unverdicted · none · ref 29
Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.
LT2: Linear-Time Looped Transformers cs.LG · 2026-05-20 · unreviewed · ref 50
Selective Rotary Position Embedding cs.CL · 2025-11-21 · unreviewed · ref 51

Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer