pith. sign in

arxiv: 2402.02750 · v2 · submitted 2024-02-05 · 💻 cs.CL · cs.LG· cs.PF

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Pith reviewed 2026-05-12 08:47 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.PF
keywords KV cache2-bit quantizationLLM inferencememory efficiencyasymmetric quantizationkey-value cachebatch size scaling
0
0 comments X

The pith

KIVI uses per-channel 2-bit quantization for keys and per-token for values to cut KV cache memory 2.6 times while preserving quality on Llama, Falcon, and Mistral.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first maps the element distributions inside the key and value caches of several large language models. This mapping shows that keys have channel-wise patterns best captured by grouping along the channel axis, while values have token-wise patterns best captured by grouping along the token axis. The authors then build a 2-bit quantization routine called KIVI that follows these groupings and requires no retraining or per-model search. If the approach works as claimed, the reduced memory footprint lets the same hardware handle four times as many simultaneous requests and raises end-to-end throughput by roughly 2.35 to 3.47 times on actual workloads.

Core claim

KIVI is a tuning-free 2-bit quantization algorithm that quantizes the key cache per-channel and the value cache per-token. On Llama, Falcon, and Mistral models it preserves output quality while reducing peak memory usage by 2.6 times, which in turn supports up to 4 times larger batch sizes and delivers 2.35 to 3.47 times higher throughput on real inference workloads.

What carries the argument

Asymmetric 2-bit quantization that groups keys along the channel dimension and values along the token dimension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory reduction could make it practical to serve the same models on smaller GPUs or to extend context lengths without adding hardware.
  • Because the method is hardware-friendly and tuning-free, it could be combined with weight quantization to lower total system memory further.
  • Production inference engines that already batch requests would see direct cost savings from the larger feasible batch sizes.

Load-bearing premise

The element distributions measured on popular LLMs will stay the same for the tested models and workloads, so the per-channel key and per-token value grouping choice remains optimal with no extra tuning.

What would settle it

Measuring clear quality loss when KIVI is applied to a new large language model or to context lengths substantially longer than those used in the paper's experiments.

read the original abstract

Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. Additionally, the loading of the KV cache causes the computational core to be idle, which limits the inference speed. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm named KIVI. With hardware-friendly implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $\mathbf{2.6\times}$ less peak memory (including model weight). This reduction in memory usage enables up to $\mathbf{4\times}$ larger batch size, bringing $\mathbf{2.35\times \sim 3.47\times}$ throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces KIVI, a tuning-free asymmetric 2-bit quantization scheme for LLM KV caches. An empirical study of element distributions across popular models leads to the design choice of per-channel quantization for keys and per-token quantization for values. The authors claim that this approach preserves model quality on Llama, Falcon, and Mistral families while reducing peak memory (including weights) by 2.6×, enabling up to 4× larger batch sizes and 2.35–3.47× higher throughput under real inference workloads, with hardware-friendly implementation and open-source code.

Significance. If the reported quality preservation and throughput gains hold under broader conditions, KIVI would address a key memory bottleneck in batched LLM serving without requiring retraining or per-model tuning. The combination of a distribution-driven design, concrete speedups on multiple model families, and public implementation would make the result practically relevant for efficient inference deployments.

major comments (3)
  1. [distribution study and method sections] The central claim of near-identical quality at 2.6× memory reduction rests on the fixed per-channel (keys) / per-token (values) grouping chosen after the element-distribution study. No ablation is presented that quantifies the quantization error or downstream quality degradation when this grouping is altered or when applied to models/contexts whose statistics deviate from the study set (e.g., longer contexts or different fine-tunes). Because the method is explicitly tuning-free, this generalization assumption is load-bearing.
  2. [Experiments] Experiments section: quality metrics are reported as “almost the same” without error bars, multiple random seeds, or statistical significance tests. In the absence of these, it is difficult to determine whether observed differences fall within normal run-to-run variation, undermining the robustness of the headline quality-preservation claim.
  3. [Experiments] The throughput and batch-size scaling results (2.35–3.47×) are measured on specific hardware and workloads. The paper does not report the precise KV-cache memory breakdown or the fraction of total memory occupied by the cache at the tested batch sizes, making it hard to verify that the 2.6× peak-memory reduction directly translates to the claimed 4× batch-size increase.
minor comments (2)
  1. [Method] Notation for the asymmetric quantizer (scale and zero-point definitions) could be made more explicit with a single equation block rather than scattered prose.
  2. [Figures] Figure captions for the distribution histograms should state the exact number of tokens/layers sampled and the models used, to allow readers to assess representativeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical support and transparency of the work without altering its core claims.

read point-by-point responses
  1. Referee: [distribution study and method sections] The central claim of near-identical quality at 2.6× memory reduction rests on the fixed per-channel (keys) / per-token (values) grouping chosen after the element-distribution study. No ablation is presented that quantifies the quantization error or downstream quality degradation when this grouping is altered or when applied to models/contexts whose statistics deviate from the study set (e.g., longer contexts or different fine-tunes). Because the method is explicitly tuning-free, this generalization assumption is load-bearing.

    Authors: We appreciate the referee's emphasis on this point. Section 3 presents a detailed empirical study of KV cache element distributions across Llama, Falcon, and Mistral families, which consistently motivates per-channel quantization for keys and per-token for values. While the paper does not include exhaustive ablations on every alternative grouping, the reported results already cover three distinct model families. To directly address the generalization concern, the revised manuscript will include a targeted ablation comparing the chosen grouping against per-token keys and per-channel values, plus additional results on contexts up to 8k tokens. We maintain that the tuning-free design is a deliberate strength, but agree that explicit robustness checks improve the presentation. revision: partial

  2. Referee: [Experiments] Experiments section: quality metrics are reported as “almost the same” without error bars, multiple random seeds, or statistical significance tests. In the absence of these, it is difficult to determine whether observed differences fall within normal run-to-run variation, undermining the robustness of the headline quality-preservation claim.

    Authors: Thank you for this observation. The primary quality metrics (perplexity on WikiText and zero-shot accuracies) are deterministic given fixed model weights, prompts, and greedy decoding; no stochastic sampling is involved in the reported numbers. Nevertheless, to enhance statistical rigor, we will rerun the relevant experiments across three random seeds (where any data-ordering randomness exists), report standard deviations, and add error bars to the tables in the revised version. We will also clarify that observed differences remain below 0.5% and fall within the negligible range for practical deployment. revision: yes

  3. Referee: [Experiments] The throughput and batch-size scaling results (2.35–3.47×) are measured on specific hardware and workloads. The paper does not report the precise KV-cache memory breakdown or the fraction of total memory occupied by the cache at the tested batch sizes, making it hard to verify that the 2.6× peak-memory reduction directly translates to the claimed 4× batch-size increase.

    Authors: We agree that a finer-grained memory breakdown would improve verifiability. In the revised manuscript we will add a dedicated table (and accompanying text) that decomposes peak memory into model weights, KV cache, activations, and other overheads at the exact batch sizes used for the throughput measurements on A100 GPUs. This will explicitly quantify the KV-cache fraction before and after KIVI and show how the 2.6× reduction enables the observed 4× batch-size scaling and throughput gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distribution study directly informs fixed grouping without tautological reduction

full rationale

The paper's derivation begins with an explicit empirical study of KV cache element distributions in popular LLMs, from which the per-channel (keys) and per-token (values) grouping is selected as an observation-driven design choice. KIVI is then constructed as a tuning-free 2-bit asymmetric quantizer using this fixed grouping. All headline performance claims (2.6× peak memory reduction, 4× batch size, 2.35–3.47× throughput) are presented as measured outcomes on Llama/Falcon/Mistral models rather than quantities derived by construction from the grouping parameters themselves. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided derivation chain; the method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical distribution study whose details are not expanded in the abstract; no free parameters are introduced because the method is tuning-free, and no new entities are postulated.

axioms (1)
  • domain assumption Quantization error from 2-bit asymmetric rounding remains tolerable when grouping is chosen according to observed per-channel and per-token statistics.
    Invoked to justify why the chosen grouping preserves quality without tuning.

pith-pipeline@v0.9.0 · 5631 in / 1258 out tokens · 29070 ms · 2026-05-12T08:47:40.743073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 conditional novelty 8.0

    HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.

  2. Block-Sphere Vector Quantization

    cs.LG 2026-05 unverdicted novelty 7.0

    BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.

  3. Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

    cs.CV 2026-05 unverdicted novelty 7.0

    RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or chan...

  4. KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

    cs.DC 2026-05 conditional novelty 7.0

    KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.

  5. The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

    cs.DC 2026-05 unverdicted novelty 7.0

    Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

  6. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  7. When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

    cs.PF 2026-05 unverdicted novelty 7.0

    A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

  8. Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

    cs.LG 2026-04 unverdicted novelty 7.0

    KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior...

  9. Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

    cs.LG 2026-04 unverdicted novelty 7.0

    High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...

  10. How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

    cs.LG 2026-04 unverdicted novelty 7.0

    Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...

  11. CodeComp: Structural KV Cache Compression for Agentic Coding

    cs.CL 2026-04 unverdicted novelty 7.0

    CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context pa...

  12. Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.

  13. Fast Cross-Operator Optimization of Attention Dataflow

    cs.AR 2026-04 unverdicted novelty 7.0

    MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

  14. Efficient Remote KV Cache Reuse with GPU-native Video Codec

    cs.DC 2026-02 conditional novelty 7.0

    KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

  15. DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

    cs.CL 2025-10 conditional novelty 7.0

    DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...

  16. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

    cs.CL 2024-10 conditional novelty 7.0

    DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

  17. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    cs.LG 2024-07 accept novelty 7.0

    FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

  18. Runtime-Certified Bounded-Error Quantized Attention

    cs.LG 2026-05 unverdicted novelty 6.0

    A tiered KV cache architecture computes per-head per-step error bounds on quantized attention and uses adaptive fallback to guarantee bounded or exact outputs relative to FP16 reference.

  19. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...

  20. OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

    cs.LG 2026-05 unverdicted novelty 6.0

    OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

  21. DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

    cs.CL 2026-05 unverdicted novelty 6.0

    DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

  22. Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.

  23. OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

  24. VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

    cs.AR 2026-05 unverdicted novelty 6.0

    VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

  25. Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

    cs.LG 2026-05 unverdicted novelty 6.0

    SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

  26. SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.

  27. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

    cs.AR 2026-05 unverdicted novelty 6.0

    KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

  28. RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...

  29. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  30. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

    cs.AI 2026-05 unverdicted novelty 6.0

    SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...

  31. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.

  32. WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.

  33. PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

    cs.LG 2026-04 conditional novelty 6.0

    A single shared asymmetrically compressed KV cache pool enables up to 15 concurrent LLM agents with 2.91x compression, 97.7% memory reduction, and only +0.57% perplexity increase on Llama-3-8B.

  34. Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

    cs.AR 2026-04 unverdicted novelty 6.0

    Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

  35. SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

    cs.NI 2026-04 unverdicted novelty 6.0

    SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...

  36. DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

    cs.CL 2026-04 unverdicted novelty 6.0

    DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

  37. Graph-Guided Adaptive Channel Elimination for KV Cache Compression

    eess.SP 2026-04 unverdicted novelty 6.0

    GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.

  38. Quantization Dominates Rank Reduction for KV-Cache Compression

    cs.LG 2026-04 conditional novelty 6.0

    Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...

  39. eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

    cs.LG 2026-04 unverdicted novelty 6.0

    eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.

  40. TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

    cs.CL 2026-03 unverdicted novelty 6.0

    TTKV reduces cross-tier KV cache traffic by 5.94x on 128K-context tasks and cuts latency up to 76% by using temporal tiers, HBM/DRAM separation, and block-wise streaming attention.

  41. EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

    cs.CL 2026-03 unverdicted novelty 6.0

    EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand f...

  42. InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

    cs.LG 2026-02 unverdicted novelty 6.0

    InnerQ delivers 1.3x average speedup over prior KV cache quantization and 2.7x over baseline by inner-dimension grouping, hybrid symmetric/asymmetric quantization, high-precision windows for recent and sink tokens, an...

  43. HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

    cs.CL 2026-01 unverdicted novelty 6.0

    HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x fas...

  44. OjaKV: Context-Aware Online Low-Rank KV Cache Compression

    cs.CL 2025-09 unverdicted novelty 6.0

    OjaKV introduces hybrid full-rank storage for key tokens combined with online low-rank KV cache compression via Oja's algorithm to support memory-efficient long-context LLM inference.

  45. EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

    cs.CL 2025-09 unverdicted novelty 6.0

    EpiCache clusters long conversation history into coherent episodes for per-episode KV cache eviction, delivering up to 30% accuracy gains and 3.7x peak memory reduction on LongConvQA tasks under fixed budgets.

  46. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    cs.LG 2025-04 unverdicted novelty 6.0

    TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a fac...

  47. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    cs.CL 2024-07 accept novelty 6.0

    Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...

  48. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  49. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    cs.CL 2023-12 unverdicted novelty 6.0

    ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.

  50. How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

    cs.LG 2026-05 unverdicted novelty 5.0

    Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.

  51. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 unverdicted novelty 5.0

    HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.

  52. HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

    cs.DC 2026-04 unverdicted novelty 5.0

    HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...

  53. LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

  54. Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

    cs.CV 2026-04 unverdicted novelty 5.0

    A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...

  55. SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

    cs.LG 2026-02 conditional novelty 5.0

    SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.

  56. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  57. AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results

    cs.DC 2025-02 unverdicted novelty 5.0

    AIvaluateXR benchmarks 17 LLMs across four XR platforms on performance, speed, memory and battery metrics and proposes a 3D Pareto optimality method to identify optimal on-device model-device pairs.

  58. A Simple Plug-in for Improving Eviction-Based KV Cache Compression

    cs.LG 2026-05 unverdicted novelty 4.0

    VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.

  59. Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

    cs.LG 2026-05 unverdicted novelty 4.0

    Structural protection of boundary tokens in globally capped KV cache eviction recovers 69-90% of full-cache quality at 13% retention and dominates differences among scoring policies.

  60. Hierarchical vs. Flat Iteration in Shared-Weight Transformers

    cs.CL 2026-04 unverdicted novelty 4.0

    Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 60 Pith papers · 12 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,

  2. [2]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508,

  3. [3]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    Understand- ing different design choices in training large time series models

    Yu-Neng Chuang, Songchen Li, Jiayi Yuan, Guanchu Wang, Kwei-Herng Lai, Leisheng Yu, Sirui Ding, Chia-Yuan Chang, Qiaoyu Tan, Daochen Zha, et al. Understand- ing different design choices in training large time series models. arXiv preprint arXiv:2406.14045,

  5. [5]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

  6. [6]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323,

  7. [7]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079,

  8. [8]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  9. [9]

    S 3: Increasing gpu utilization during generative inference for higher throughput,

    Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei. S3: Increasing gpu utilization during genera- tive inference for higher throughput. arXiv preprint arXiv:2306.06000,

  10. [10]

    W., and Keutzer, K

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629,

  11. [11]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,

  12. [12]

    Landmark attention: Random-access infinite context length for transformers

    Amirkeivan Mohtashami and Martin Jaggi. Landmark at- tention: Random-access infinite context length for trans- formers. arXiv preprint arXiv:2305.16300,

  13. [13]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outper- forming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,

  14. [14]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  15. [15]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150,

  16. [16]

    Galactica: A Large Language Model for Science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085,

  17. [17]

    Scan and snap: Understanding training dynamics and token composition in 1-layer transformer

    Yuandong Tian, Yiping Wang, Beidi Chen, and Simon Du. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. arXiv preprint arXiv:2305.16380,

  18. [18]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikola...

  19. [19]

    Large language models for healthcare data augmentation: An example on patient-trial matching

    Jiayi Yuan, Ruixiang Tang, Xiaoqian Jiang, and Xia Hu. Large language models for healthcare data augmentation: An example on patient-trial matching. In AMIA Annual Symposium Proceedings, volume 2023, page

  20. [20]

    Kv cache compression, but what must we give in return? a compre- hensive benchmark of long context capable approaches

    Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, and Xia Hu. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable ap- proaches. arXiv preprint arXiv:2407.01527,

  21. [21]

    H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048,

  22. [22]

    Detailed Implementations In this section, we present the algorithm for KIVI as discussed in Section 3.3

    11 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache A. Detailed Implementations In this section, we present the algorithm for KIVI as discussed in Section 3.3. Specifically, we provide the pseudocode for KIVI when calculating the attention output in the prefill and decoding phases. Algorithm 1: The KIVI Prefill & Decoding Algorithm parameter:...

  23. [23]

    while including some modern modifications set forward by Arize-ai and the technical report of Gemini 1.5 (Reid et al., 2024). 2https://paulgraham.com/articles.html 13 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache Table 7: Performance evaluation of KIVI with residual length 128 and 32 on various models across a range of benchmarks in LongBe...