GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.
Llms know what to drop: Self-attention guided kv cache eviction for efficient long-context inference.arXiv preprint arXiv:2503.08879
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
citing papers explorer
-
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.