hub

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, Ali Hatamizadeh · 2024 · cs.CL · arXiv 2412.06464

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

open full Pith review browse 25 citing papers arXiv PDF

abstract

Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.

SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

q-bio.NC · 2026-05-13 · unverdicted · novelty 7.0

SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.

Mixture of Layers with Hybrid Attention

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.

Transformers with Selective Access to Early Representations

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

cs.CL · 2026-04-01 · conditional · novelty 7.0

S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.

A Single-Layer Model Can Do Language Modeling

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.

Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.

Training Transformers for KV Cache Compressibility

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

Learning to Forget: Continual Learning with Adaptive Weight Decay

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attention baselines on LongBench and RAG tasks at 8x-32x compression.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

Olmo Hybrid: From Theory to Practice and Back

cs.LG · 2026-04-03 · conditional · novelty 6.0

A 7B hybrid attention-recurrent model outperforms its pure-transformer counterpart on pretraining metrics and scales more efficiently, supported by a proof that hybrids are strictly more expressive than either transformers or linear RNNs.

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

cs.DC · 2026-05-07 · unverdicted · novelty 5.0

Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

Reasoning Primitives in Hybrid and Non-Hybrid LLMs

cs.CL · 2026-04-23 · unverdicted · novelty 5.0

Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

cs.LG · 2026-04-21 · unverdicted · novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.

citing papers explorer

Showing 25 of 25 citing papers.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 79 · internal anchor
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction cs.LG · 2026-05-13 · unverdicted · none · ref 32 · internal anchor
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting q-bio.NC · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
Mixture of Layers with Hybrid Attention cs.LG · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.
Transformers with Selective Access to Early Representations cs.LG · 2026-05-05 · unverdicted · none · ref 11 · 2 links · internal anchor
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 62 · internal anchor
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 67 · internal anchor
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models cs.CL · 2026-04-01 · conditional · none · ref 17 · internal anchor
S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
A Single-Layer Model Can Do Language Modeling cs.CL · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 70 · internal anchor
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators cs.LG · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 92 · internal anchor
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
Training Transformers for KV Cache Compressibility cs.LG · 2026-05-07 · unverdicted · none · ref 55 · 2 links · internal anchor
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 35 · internal anchor
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Learning to Forget: Continual Learning with Adaptive Weight Decay cs.LG · 2026-04-29 · unverdicted · none · ref 50 · internal anchor
FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 53 · internal anchor
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention cs.LG · 2026-04-22 · unverdicted · none · ref 12 · internal anchor
Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attention baselines on LongBench and RAG tasks at 8x-32x compression.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 60 · internal anchor
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Olmo Hybrid: From Theory to Practice and Back cs.LG · 2026-04-03 · conditional · none · ref 8 · internal anchor
A 7B hybrid attention-recurrent model outperforms its pure-transformer counterpart on pretraining metrics and scales more efficiently, supported by a proof that hybrids are strictly more expressive than either transformers or linear RNNs.
Beyond Similarity: Temporal Operator Attention for Time Series Analysis cs.LG · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 24 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 33 · internal anchor
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
Reasoning Primitives in Hybrid and Non-Hybrid LLMs cs.CL · 2026-04-23 · unverdicted · none · ref 14 · internal anchor
Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control cs.LG · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
On The Application of Linear Attention in Multimodal Transformers cs.CV · 2026-04-11 · unverdicted · none · ref 39 · internal anchor
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

Gated Delta Networks: Improving Mamba2 with Delta Rule

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer