Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
Attention sinks in diffusion language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7representative citing papers
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
Suppressing attention sinks in Stable Diffusion 3 does not degrade text-image alignment or preference metrics at mild intervention levels, though stronger suppression reveals sink-specific perceptual shifts larger than random masking.
MLA-Gen advances text-driven motion synthesis by aligning global motion patterns with fine-grained text semantics and mitigating attention sink effects via new masking techniques.
citing papers explorer
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.