hub Canonical reference

Rwkv-7" goose" with expressive dynamic state evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Hou, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, et al · 2025 · arXiv 2503.14456

Canonical reference. 100% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

PWO is a trust-region optimizer for autoregressive NQS that improves stability over Adam and stochastic reconfiguration methods while scaling to 1.5B-parameter models on spin systems.

Tapered Language Models

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.

An Algebraic View of the Expressivity of Recurrent Language Models

cs.FL · 2026-06-01 · unverdicted · novelty 7.0

A unified algebraic account reduces RNN expressivity to syntactic monoid division in wreath products and shows diagonal state-space models realize every even-modulus counter under unsigned-integer quantization but none under floating-point recurrences.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

MG-RWKV: Multi-Grained Context-Aware RWKV for Temporal Forgery Localization

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

MG-RWKV combines bidirectional RWKV, multi-granularity mixture of experts, and cross-granularity consistency to achieve state-of-the-art temporal forgery localization with linear complexity.

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.

Dynamic Short Convolutions Improve Transformers

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

DSRD unifies temporal and structural adaptation for dynamic graphs via a single recurrent retentive state with learnable time-sensitivity parameters in the decay kernels.

Universal Time Series Generation with Neural Controlled Differential Equations

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.

Towards Understanding Self-Pretraining for Sequence Classification

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

A Single-Layer Model Can Do Language Modeling

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

Optimal Decay Spectra for Linear Recurrences

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval across Mamba-2, RWKV-7 and similar models.

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

cs.LG · 2025-11-26 · unverdicted · novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

Higher-order Linear Attention

cs.LG · 2025-10-31 · unverdicted · novelty 6.0

Higher-order Linear Attention realizes second-order and higher interactions in linear-time causal attention via constant-size state and associative scans.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

cs.LG · 2026-04-21 · unverdicted · novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.

Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

Fine-tuned recurrent models like Mamba2 produce competitive text embeddings with linear-time constant-memory inference via vertical chunking, outperforming transformers in memory use.

Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

cs.LG · 2025-10-30 · unverdicted · novelty 5.0

Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.

citing papers explorer

Showing 28 of 28 citing papers.

Test-Time Training with KV Binding Is Secretly Linear Attention cs.LG · 2026-02-24 · conditional · none · ref 13
Test-time training with KV binding reduces to learned linear attention.
One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective cs.LG · 2026-07-02 · unverdicted · none · ref 154
PWO is a trust-region optimizer for autoregressive NQS that improves stability over Adam and stochastic reconfiguration methods while scaling to 1.5B-parameter models on spin systems.
Tapered Language Models cs.LG · 2026-06-22 · unverdicted · none · ref 26
Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
An Algebraic View of the Expressivity of Recurrent Language Models cs.FL · 2026-06-01 · unverdicted · none · ref 39
A unified algebraic account reduces RNN expressivity to syntactic monoid division in wreath products and shows diagonal state-space models realize every even-modulus counter under unsigned-integer quantization but none under floating-point recurrences.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 39
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
MG-RWKV: Multi-Grained Context-Aware RWKV for Temporal Forgery Localization cs.CV · 2026-07-01 · unverdicted · none · ref 36
MG-RWKV combines bidirectional RWKV, multi-granularity mixture of experts, and cross-granularity consistency to achieve state-of-the-art temporal forgery localization with linear complexity.
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning cs.AI · 2026-06-10 · unverdicted · none · ref 15
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
Dynamic Short Convolutions Improve Transformers cs.LG · 2026-06-02 · unverdicted · none · ref 65
Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.
Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs cs.LG · 2026-05-28 · unverdicted · none · ref 10
DSRD unifies temporal and structural adaptation for dynamic graphs via a single recurrent retentive state with learnable time-sensitivity parameters in the decay kernels.
Universal Time Series Generation with Neural Controlled Differential Equations cs.LG · 2026-05-27 · unverdicted · none · ref 49
Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.
Towards Understanding Self-Pretraining for Sequence Classification cs.LG · 2026-05-20 · unverdicted · none · ref 94
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 39
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 83
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
A Single-Layer Model Can Do Language Modeling cs.CL · 2026-05-11 · unverdicted · none · ref 8
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 26
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Optimal Decay Spectra for Linear Recurrences cs.LG · 2026-04-08 · unverdicted · none · ref 9
PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval across Mamba-2, RWKV-7 and similar models.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 45
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
Higher-order Linear Attention cs.LG · 2025-10-31 · unverdicted · none · ref 8
Higher-order Linear Attention realizes second-order and higher interactions in linear-time causal attention via constant-size state and associative scans.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 72
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 58 · 2 links
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 45
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control cs.LG · 2026-04-21 · unverdicted · none · ref 14
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models cs.CL · 2026-04-20 · unverdicted · none · ref 19
Fine-tuned recurrent models like Mamba2 produce competitive text embeddings with linear-time constant-memory inference via vertical chunking, outperforming transformers in memory use.
Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism cs.LG · 2025-10-30 · unverdicted · none · ref 28
Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.
When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet cs.LG · 2026-06-04 · unverdicted · none · ref 51
A multiplication-only truncated Neumann approximation for matrix inversion in quantized Gated DeltaNet linear attention delivers up to 5x kernel speedup and 20% decode overhead reduction while preserving accuracy on Qwen3.5 models.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 39
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
Learning State-Tracking from Code Using Linear RNNs cs.LG · 2026-02-16 · unreviewed · ref 11
Selective Rotary Position Embedding cs.CL · 2025-11-21 · unreviewed · ref 47

Rwkv-7" goose" with expressive dynamic state evolution

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer