How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse

Yichuan Deng, Zhao Song, Jing Xiong, Chiwun Yang · 2025 · arXiv 2404.02690

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

cs.LG · 2025-05-05 · conditional · novelty 6.0

RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

cs.LG · 2024-09-16 · conditional · novelty 6.0

RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.

citing papers explorer

Showing 6 of 6 citing papers.

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing cs.CV · 2026-05-17 · unverdicted · none · ref 13
HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction cs.LG · 2026-05-10 · unverdicted · none · ref 8
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 18
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 8
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference cs.LG · 2025-05-05 · conditional · none · ref 23
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval cs.LG · 2024-09-16 · conditional · none · ref 61
RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.

How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer