hub Canonical reference

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu · 2024 · cs.CL · arXiv 2406.02069

Canonical reference. 75% of citing Pith papers cite this work as background.

80 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 80 citing papers arXiv PDF

abstract

In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100.0 Acc. performance.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 method 2 baseline 1

citation-polarity summary

background 12 use method 2 baseline 1 unclear 1

representative citing papers

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

LOCOS scores attention heads via OV-circuit output projection onto answer-token unembedding directions and identifies non-literal retrieval heads whose ablation collapses performance on non-literal benchmarks more than prior literal-copy detectors.

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

AsymVLM introduces asymmetric token pruning for vision and text in VLMs to deliver up to 54% FLOPs reduction while matching or exceeding prior methods on localized visual tasks.

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

cs.IT · 2026-05-24 · unverdicted · novelty 7.0

Under a polynomial context-truncation sensitivity assumption, suffix-only KV cache policies require per-token memory scaling as Θ(ε^{-1/α}) to achieve distortion ε.

Layer-wise Token Compression for Efficient Document Reranking

cs.IR · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

cs.CV · 2026-05-20 · conditional · novelty 7.0

HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.

FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

FibQuant is a universal fixed-rate vector quantizer for KV-cache compression that uses a radial-angular codebook matched to the spherical-Beta source after Haar rotation and strictly outperforms scalar quantization at matched rates.

How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adaptive-oblivious error separation.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

Transactional Attention: Semantic Sponsorship for KV-Cache Retention

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

cs.DC · 2026-03-05 · unverdicted · novelty 7.0

Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

cs.LG · 2025-02-03 · unverdicted · novelty 7.0

FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

VaSE improves KV cache eviction accuracy for reasoning models by over 4% versus prior eviction methods at 4x compression through value-magnitude protection and stochastic diversity.

TGV-KV: Text-Grounded KV Eviction for Vision-Language Models

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

TGV-KV uses text-vision budgeting, weighted ranking, and prioritised retention to evict KV cache in VLMs while retaining 99.2% accuracy at 5% budget on VizWiz-VQA.

Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads

cs.DC · 2026-06-01 · unverdicted · novelty 6.0

Albireo overlaps non-scalable overheads with compute in tensor-parallel LLM inference to raise the empirical optimal TP degree, delivering up to 1.9x throughput and 48% lower latency versus vLLM.

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

STaR-KV is a training-free KV cache compression framework for GUI VLMs that uses subspace-aware scoring, temporal stability discounts, and entropy-based temperature adaptation to outperform prior methods at matched budgets while reducing peak memory by ~40% at 20% cache size.

MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

MomentKV maintains count, key mean, value mean, and value-key covariance over evicted tokens to guide selective eviction and provide a first-order approximation of their attention contribution, outperforming baselines on LongBench and RULER.

MURMUR: An Efficient Inference System for Long-Form ASR

cs.LG · 2026-05-31 · conditional · novelty 6.0

Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

citing papers explorer

Showing 50 of 80 citing papers.

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads cs.CL · 2026-07-01 · unverdicted · none · ref 8 · internal anchor
LOCOS scores attention heads via OV-circuit output projection onto answer-token unembedding directions and identifies non-literal retrieval heads whose ablation collapses performance on non-literal benchmarks more than prior literal-copy detectors.
Self-GC: Self-Governing Context for Long-Horizon LLM Agents cs.AI · 2026-07-01 · unverdicted · none · ref 34 · internal anchor
Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving cs.AI · 2026-06-04 · unverdicted · none · ref 6 · internal anchor
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference cs.LG · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
AsymVLM introduces asymmetric token pruning for vision and text in VLMs to deliver up to 54% FLOPs reduction while matching or exceeding prior methods on localized visual tasks.
Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression cs.IT · 2026-05-24 · unverdicted · none · ref 9 · internal anchor
Under a polynomial context-truncation sensitivity assumption, suffix-only KV cache policies require per-token memory scaling as Θ(ε^{-1/α}) to achieve distortion ε.
Layer-wise Token Compression for Efficient Document Reranking cs.IR · 2026-05-20 · unverdicted · none · ref 4 · 2 links · internal anchor
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation cs.CV · 2026-05-20 · conditional · none · ref 4 · internal anchor
HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.
FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression cs.AI · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
FibQuant is a universal fixed-rate vector quantizer for KV-cache compression that uses a radial-angular codebook matched to the spherical-Beta source after Haar rotation and strictly outperforms scalar quantization at matched rates.
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers cs.LG · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adaptive-oblivious error separation.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 4 · internal anchor
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention cs.CL · 2026-04-13 · unverdicted · none · ref 7 · internal anchor
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 148 · internal anchor
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression cs.CL · 2026-04-06 · unverdicted · none · ref 2 · internal anchor
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs cs.DC · 2026-03-05 · unverdicted · none · ref 1 · internal anchor
Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 27 · internal anchor
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration cs.LG · 2025-02-03 · unverdicted · none · ref 8 · internal anchor
FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
Value-Aware Stochastic KV Cache Eviction for Reasoning Models cs.LG · 2026-06-02 · unverdicted · none · ref 40 · internal anchor
VaSE improves KV cache eviction accuracy for reasoning models by over 4% versus prior eviction methods at 4x compression through value-magnitude protection and stochastic diversity.
TGV-KV: Text-Grounded KV Eviction for Vision-Language Models cs.CV · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
TGV-KV uses text-vision budgeting, weighted ranking, and prioritised retention to evict KV cache in VLMs while retaining 99.2% accuracy at 5% budget on VizWiz-VQA.
Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads cs.DC · 2026-06-01 · unverdicted · none · ref 3 · internal anchor
Albireo overlaps non-scalable overheads with compute in tensor-parallel LLM inference to raise the empirical optimal TP degree, delivering up to 1.9x throughput and 48% lower latency versus vLLM.
STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models cs.CV · 2026-06-01 · unverdicted · none · ref 22 · internal anchor
STaR-KV is a training-free KV cache compression framework for GUI VLMs that uses subspace-aware scoring, temporal stability discounts, and entropy-based temperature adaptation to outperform prior methods at matched budgets while reducing peak memory by ~40% at 20% cache size.
MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference cs.LG · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
MomentKV maintains count, key mean, value mean, and value-key covariance over evicted tokens to guide selective eviction and provide a first-order approximation of their attention contribution, outperforming baselines on LongBench and RULER.
MURMUR: An Efficient Inference System for Long-Form ASR cs.LG · 2026-05-31 · conditional · none · ref 28 · internal anchor
Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.
Adaptive Mass-Segmented KV Compression for Long-Context Reasoning cs.LG · 2026-05-22 · unverdicted · none · ref 14 · internal anchor
AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond cs.LG · 2026-05-19 · unverdicted · none · ref 8 · internal anchor
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees cs.AI · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 35 · internal anchor
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs cs.LG · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 39 · internal anchor
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 6 · internal anchor
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction cs.CV · 2026-05-15 · unverdicted · none · ref 3 · 2 links · internal anchor
GHOST is a geometry-hierarchical token eviction framework that halves the KV cache size in monocular video 3D reconstruction while maintaining quality and achieving 1.75x faster inference.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility cs.LG · 2026-05-13 · unverdicted · none · ref 72 · internal anchor
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference cs.LG · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
Spherical KV combines angle-domain attention using spherical key codes with rate-distortion retention to cut KV cache residency and HBM traffic while keeping a paged, fusion-friendly decode path.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation cs.CV · 2026-05-13 · unverdicted · none · ref 30 · internal anchor
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference cs.LG · 2026-05-12 · conditional · none · ref 14 · internal anchor
KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction cs.LG · 2026-05-10 · unverdicted · none · ref 3 · internal anchor
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning cs.CL · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
A semantics-aware KV cache hierarchy offloads tokens to slower memory with zero approximation error, demonstrating that LLM reasoning accuracy depends only on the permanent eviction ratio and not on HBM residency.
ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference cs.LG · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
ProxyKV offloads KV cache importance scoring to a lightweight intra-family small-model proxy with HybridAxialMapper and ranking-focused loss, matching KVZip accuracy while achieving up to 3.21x prefilling speedup on models up to 32B.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing cs.CL · 2026-05-09 · conditional · none · ref 3 · internal anchor
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61x at 128k context.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents cs.MA · 2026-05-09 · unverdicted · none · ref 64 · internal anchor
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache cs.LG · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache retention on LongBench.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference cs.CL · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache cs.LG · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility cs.AI · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at 128K context.
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding cs.AR · 2026-04-27 · unverdicted · none · ref 5 · internal anchor
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing cs.CL · 2026-04-21 · unverdicted · none · ref 15 · internal anchor
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression eess.SP · 2026-04-18 · unverdicted · none · ref 14 · internal anchor
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction cs.LG · 2026-04-14 · unverdicted · none · ref 14 · internal anchor
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster decoding.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference cs.DC · 2026-04-07 · unverdicted · none · ref 6 · internal anchor
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization cs.LG · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.
LightThinker++: From Reasoning Compression to Memory Management cs.CL · 2026-04-04 · unverdicted · none · ref 64 · internal anchor
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer