Parallelizing linear transformers with the delta rule over sequence length

Yang, Songlin, Wang, Bailin, Zhang, Yu, Shen, Yikang, Kim, Yoon , booktitle = · 2024 · DOI 10.52202/079017-3668

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Efficiently Representing Algorithms With Chain-of-Thought Transformers

cs.LG · 2026-06-18 · conditional · novelty 8.0

CoT transformers simulate any Word RAM algorithm with poly-logarithmic overhead in three architectures, improving on quadratic TM overhead.

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

cs.CL · 2026-06-04 · unverdicted · novelty 5.0

YouZhi-LLM applies a layer-adaptive GQA-to-MLA transition plus Ascend-specific distillation and fine-tuning to reduce KV-cache size, yielding up to 2.69× higher concurrency and modest gains on financial benchmarks versus base models.

On Subquadratic Architectures: From Applications to Principles

cs.LG · 2026-06-10 · unverdicted · novelty 4.0

xLSTM outperforms Mamba-2 and Gated DeltaNet on tasks with complex dependencies because its gating scheme enables more flexible and stable state tracking and memory accumulation.

citing papers explorer

Showing 5 of 5 citing papers.

Efficiently Representing Algorithms With Chain-of-Thought Transformers cs.LG · 2026-06-18 · conditional · none · ref 39
CoT transformers simulate any Word RAM algorithm with poly-logarithmic overhead in three architectures, improving on quadratic TM overhead.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 65
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
Priming: Hybrid State Space Models From Pre-trained Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 95
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.
YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition cs.CL · 2026-06-04 · unverdicted · none · ref 40
YouZhi-LLM applies a layer-adaptive GQA-to-MLA transition plus Ascend-specific distillation and fine-tuning to reduce KV-cache size, yielding up to 2.69× higher concurrency and modest gains on financial benchmarks versus base models.
On Subquadratic Architectures: From Applications to Principles cs.LG · 2026-06-10 · unverdicted · none · ref 15
xLSTM outperforms Mamba-2 and Gated DeltaNet on tasks with complex dependencies because its gating scheme enables more flexible and stable state tracking and memory accumulation.

Parallelizing linear transformers with the delta rule over sequence length

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer