KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
You only cache once: Decoder-decoder architectures for language models
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8roles
background 1polarities
background 1representative citing papers
CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
Block-based double decoders use doubly-causal block attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, outperforming standard encoder-decoders in scaling experiments.
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
citing papers explorer
-
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
-
You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.
-
Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
-
Block-Based Double Decoders
Block-based double decoders use doubly-causal block attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, outperforming standard encoder-decoders in scaling experiments.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
-
Q-Delta: Beyond Key-Value Associative State Evolution
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
-
Gated Delta Networks: Improving Mamba2 with Delta Rule
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.