DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

· 2025 · cs.CL · arXiv 2510.09883

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. One approach to reduce this latency is to evict entries from the key-value (KV) cache, thereby reducing the active context used in attention computation. However, such sparse attention methods suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the evolving importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that improves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{$\Delta$-layers} that identify salient tokens via aggregated head-level attention scores, and subsequent sparse-attention layers that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to $4.25\times$ and delivering $1.54\times$ end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning. The code is available at https://github.com/hoenza/DELTA.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.

citing papers explorer

Showing 2 of 2 citing papers.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 139 · internal anchor
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
You Only Index Once: Cross-Layer Sparse Attention with Shared Routing cs.CL · 2026-06-04 · unverdicted · none · ref 8 · internal anchor
CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer