You only cache once: Decoder-decoder architectures for language models

Sun, Y · 2024 · arXiv 2405.05254

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.

Block-Based Double Decoders

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Block-based double decoders use doubly-causal block attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, outperforming standard encoder-decoders in scaling experiments.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

cs.CL · 2024-10-17 · unverdicted · novelty 6.0

LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.

Q-Delta: Beyond Key-Value Associative State Evolution

cs.AI · 2026-06-07 · unverdicted · novelty 5.0

Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.

Gated Delta Networks: Improving Mamba2 with Delta Rule

cs.CL · 2024-12-09 · unverdicted · novelty 5.0

Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.

citing papers explorer

Showing 8 of 8 citing papers.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 58
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
You Only Index Once: Cross-Layer Sparse Attention with Shared Routing cs.CL · 2026-06-04 · unverdicted · none · ref 27
CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.
Do Value Vectors in Deep Layers Need Context from the Residual Stream? cs.CL · 2026-06-01 · unverdicted · none · ref 103
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
Block-Based Double Decoders cs.LG · 2026-05-11 · unverdicted · none · ref 9 · 2 links
Block-based double decoders use doubly-causal block attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, outperforming standard encoder-decoders in scaling experiments.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 96
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation cs.CL · 2024-10-17 · unverdicted · none · ref 30
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
Q-Delta: Beyond Key-Value Associative State Evolution cs.AI · 2026-06-07 · unverdicted · none · ref 82
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
Gated Delta Networks: Improving Mamba2 with Delta Rule cs.CL · 2024-12-09 · unverdicted · none · ref 167
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.

You only cache once: Decoder-decoder architectures for language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer