pith. sign in

arxiv: 2606.08382 · v1 · pith:4LWIJYJXnew · submitted 2026-06-07 · 💻 cs.LG · cs.AI

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Pith reviewed 2026-06-27 18:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cache compressionlow-rank approximationsoft thresholdingadaptive rank selectionLLM inference optimizationmixed precision quantizationattention speedup
0
0 comments X

The pith

A differentiable soft thresholding mechanism allows adaptive low-rank compression of the KV cache at head and block levels with up to 75% reduction and minimal accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that fixed or heuristic rank choices limit how aggressively the KV cache can be compressed without hurting model accuracy. STAR-KV replaces those choices with a differentiable soft thresholding step that picks the right rank separately for each attention head and block. It pairs this with a hybrid decomposition that treats key and value projections differently according to their sensitivity and adds a low-rank-aware mixed-precision quantization step. If the mechanism works, inference can run with far less memory and higher speed on the same hardware while keeping the model output quality nearly unchanged. Readers would care because the KV cache is the main memory bottleneck that grows with context length in large language models.

Core claim

STAR-KV achieves adaptive low-rank KV cache compression through a differentiable thresholding mechanism that selects optimal ranks at both attention-head and block levels, a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and a low-rank-aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, this delivers up to 75% KV cache compression and up to 20x overall reduction when combined with quantization, along with up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput.

What carries the argument

The differentiable thresholding mechanism that enables optimal rank selection at attention-head and block levels.

If this is right

  • KV cache memory footprint drops by up to 75 percent while accuracy stays close to the original model.
  • Combining the compression with quantization yields up to 20 times overall KV cache size reduction.
  • Attention module execution speeds up by as much as 6.9 times and end-to-end generation throughput by 3.1 times.
  • The same framework applies across different LLMs and benchmarks without per-model retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The per-head and per-block rank decisions could be applied to compress other transformer activations beyond the KV cache.
  • The approach might allow longer context windows on hardware that previously hit memory limits.
  • Further pairing with speculative decoding or other inference accelerations could produce multiplicative speed gains.

Load-bearing premise

The differentiable thresholding mechanism enables optimal rank selection at attention-head and block levels while maintaining minimal accuracy degradation, without requiring model-specific retuning or introducing training instability.

What would settle it

Applying the method to a new LLM on a standard benchmark where accuracy drops more than the minimal degradation reported or where compression falls short of 75% without retuning would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.08382 by Ashkan Moradifirouzabadi, Jungwook Choi, Mingu Kang, Priyansh Bhatnagar, Se-Hyun Yang, Seungjae Lee.

Figure 1
Figure 1. Figure 1: The average zero-shot accuracy of LongChat-v1.5-7B on five tasks (ARC-e, ARC-c, OBQA, PIQA, Hella) across different KV cache compression rates applied by SoTA methods. ize the rank selection of each low-rank KV representation using a differentiable threshold applied to the singular val￾ues. By doing so, STAR-KV yields rank profiles that are learned, in contrast to prior approaches that set ranks via heuris… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of rank selection strategies at varying KV cache compression rates. Average zero-shot accuracy is reported for LongChat-v1.5-7B evaluated on common sense tasks. prior to computing attention scores. The position-dependent RoPE matrix is inserted between WQ and UK, preventing offline fusion (Li et al., 2025; Ji et al., 2025). In contrast, value processing does not involve positional encoding and t… view at source ↗
Figure 3
Figure 3. Figure 3: A projection matrix is decomposed as W = UΣV T . Applying a threshold α on the diagonal of Σ suppresses singular values with σi < α to zero and retains only the corresponding columns of U and rows of VT , yielding an effective rank reff. reaching 26.33% for the static and 17.14% for the heuristic￾based approach at 75% compression. This motivates an adaptive rank-selection strategy that supports aggressive … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of JD and HD. a) Singular Values for Key Projection Matrix b) Singular Values for Value Projection MatrixBlock Index Normalized Singular Value Magnitude [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The singular-value spectra of the key and value projection matrices in LLaMA-2-7B, normalized by their spectral norms (∥ · ∥2), across 32 decoder blocks. the compressed student model to match the teacher’s predic￾tive distribution, preserving model behavior under aggres￾sive compression. The final training objective is therefore: Ltot = LKD + γ · Lacmp, (5) where γ scales the contributions of the compressi… view at source ↗
Figure 7
Figure 7. Figure 7: Normalized accuracy and reconstruction costs of JD and HD with different combinations for keys and values. The average zero-shot accuracy is measured across six tasks. The latency is the profiling result of the matrix multiplication operations involved in reconstruction on an NVIDIA RTX 4090 GPU. Hin Outlier channels a) Before Block-wise-Hadamard b) After Block-wise-Hadamard cout cout reff - cout reff - co… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of low-rank key states for the first head of the third layer in LongChat before and after applying block-wise Hadamard, which redistributes channel magnitudes and enables mixed-precision quantization with minimal degradation. this trend: applying JD to both WK and WV yields the best accuracy but incurs the highest reconstruction latency, whereas using HD for keys and JD for values achieves th… view at source ↗
Figure 9
Figure 9. Figure 9: Normalized speedup of the attention module over the PyTorch SDPA FP16 implementation. The batch size is 16 in all setups. Solid lines represent exact measurements, while dashed lines indicate that the PyTorch baseline is out of memory. For these data points, the latencies are measured for the non-baseline variants, and the speedups are compared against the estimated baseline’s latency. Contx. Len. Pytorch … view at source ↗
Figure 10
Figure 10. Figure 10: Ranks learned for Key projection matrices for each block and head using our adaptive rank selection strategy at 75% compression. The maximum rank of each head is 128. Block Index Effective Rank Learnt [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ranks learned for Value projection matrices for each block using our adaptive rank selection strategy at 75% compression. The maximum rank of each layer is 4096 [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The latency breakdown of the attention operator in LLaMA-2-7B on the RTX 4090 GPU. The batch size is 16 in all setups. We performed a runtime breakdown of STAR-KV at 75% compression with and without quantization on LLaMA-2-7B, comparing it with PyTorch SDPA at context lengths of 8K, 16K, and 32K with batch size 16, as shown in [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Normalized speedup of the attention module over the PyTorch SDPA FP16 implementation in RTX 6000 Pro GPU. The batch size is 16 in all setups. latent keys are cached in a pre-RoPE form because RoPE can only be applied after reconstructing the latent keys using the decomposed key projections. Therefore, naively applying RoPE to all cached keys at each step would be expensive. Moreover, when the KV cache is … view at source ↗
read the original abstract

Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control. STAR-KV encompasses 1) a differentiable thresholding mechanism that enables optimal rank selection at both attention-head and block levels, 2) a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and 3) a low-rank-aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, STAR-KV delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput. Our code is publicly available at: https://github.com/PriyanshBhatnagar/STAR-KV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces STAR-KV, an adaptive low-rank KV cache compression framework for LLMs featuring (1) a differentiable soft thresholding mechanism for per-head and per-block rank selection, (2) a hybrid decomposition strategy that applies different factorizations to key and value projections according to sensitivity, and (3) low-rank-aware mixed-precision quantization. It reports up to 75% KV cache compression (20× when combined with quantization), 6.9× attention-module speedup, and 3.1× end-to-end generation throughput across multiple LLMs and benchmarks, with publicly available code.

Significance. If the empirical claims hold and the adaptive mechanism generalizes without per-model retuning, the work would be a meaningful engineering advance for memory- and compute-efficient LLM inference. The public code and combination of adaptive rank control with quantization are concrete strengths that support reproducibility and practical impact.

major comments (2)
  1. [Method section (differentiable thresholding and hybrid decomposition)] The central claim that the differentiable thresholding mechanism enables optimal, stable rank selection at head/block levels without model-specific retuning or training instability is load-bearing for the reported 75% compression and 6.9× speedup. No analysis of gradient stability through the soft-threshold operator, convergence of the rank parameters, or cross-model transfer without retuning is provided.
  2. [Experimental evaluation] The experimental results section reports specific compression ratios, speedups, and accuracy numbers, yet provides insufficient detail on baselines, exact accuracy metrics, number of runs, statistical significance, and fair comparison of the custom Triton kernels, making it impossible to verify whether the claims are supported.
minor comments (2)
  1. [Abstract] The abstract states results are obtained 'across multiple LLMs and benchmarks' without naming them; adding the specific models and tasks would improve clarity.
  2. [Method section] Notation for the soft-threshold operator and the low-rank-aware quantizer could be made more explicit by including the forward/backward equations and the precise definition of the mixed-precision levels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Method section (differentiable thresholding and hybrid decomposition)] The central claim that the differentiable thresholding mechanism enables optimal, stable rank selection at head/block levels without model-specific retuning or training instability is load-bearing for the reported 75% compression and 6.9× speedup. No analysis of gradient stability through the soft-threshold operator, convergence of the rank parameters, or cross-model transfer without retuning is provided.

    Authors: We agree that explicit analysis of gradient flow and convergence would strengthen the presentation. The soft-thresholding operator is constructed to be sub-differentiable, with non-zero gradients passed only through active components (analogous to proximal operators in sparse optimization). In the revised manuscript we will insert a new subsection (3.4) containing: (i) the gradient derivation, (ii) training curves for the learned rank parameters across layers, and (iii) an expanded transfer experiment showing that the same hyper-parameters yield comparable compression on Llama-2-7B, Mistral-7B, and Qwen-7B without per-model retuning. These additions will be accompanied by the already-public code. revision: yes

  2. Referee: [Experimental evaluation] The experimental results section reports specific compression ratios, speedups, and accuracy numbers, yet provides insufficient detail on baselines, exact accuracy metrics, number of runs, statistical significance, and fair comparison of the custom Triton kernels, making it impossible to verify whether the claims are supported.

    Authors: We accept that the current experimental section lacks sufficient detail for independent verification. In the revision we will expand Section 4 with: a complete baseline table including citations and implementation sources; precise metric definitions (WikiText-2 perplexity, zero-shot accuracy on the listed tasks); results from five independent runs reported as mean ± standard deviation; paired t-test p-values against the strongest baseline; and a dedicated paragraph describing the Triton kernels, hardware (A100-80GB), and identical evaluation protocol used for all methods. These changes will make the reported speedups and accuracy numbers fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical engineering method with external validation

full rationale

The paper introduces STAR-KV as a practical compression framework relying on differentiable soft thresholding, hybrid K/V decomposition, and low-rank-aware quantization. All performance claims (75% compression, speedups) are presented as outcomes of empirical evaluation across multiple LLMs and benchmarks, with public code provided. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the central mechanisms are proposed ansatzes validated externally rather than derived from prior results by the same authors in a load-bearing way.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on several tunable components for thresholding and quantization whose specific values are not specified in the abstract, and assumes the standard transformer architecture properties hold for the low-rank approximation to be effective.

free parameters (2)
  • soft thresholding parameters
    Parameters controlling the differentiable rank selection mechanism are likely fitted or tuned per model or task.
  • mixed precision levels
    Quantization bit widths are chosen based on data statistics and sensitivity.
axioms (1)
  • domain assumption Low-rank approximation can capture redundancy in KV projections without significant information loss
    This is the foundational assumption for all low-rank KV compression methods.

pith-pipeline@v0.9.1-grok · 5760 in / 1326 out tokens · 27275 ms · 2026-06-27T18:32:33.689573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 24 canonical work pages · 13 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    URL https://arxiv.org/abs/ 2305.13245. Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in ro- tated llms,

  2. [2]

    Bai, Y ., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y ., Tang, J., and Li, J

    URL https://arxiv.org/abs/ 2404.00456. Bai, Y ., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y ., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding,

  3. [3]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    URL https: //arxiv.org/abs/2308.14508. Chang, C.-C., Lin, W.-C., Lin, C.-Y ., Chen, C.-Y ., Hu, Y .- F., Wang, P.-S., Huang, N.-C., Ceze, L., Abdelfattah, M. S., and Wu, K.-C. Palu: Kv-cache compression with low-rank projection. InThe Thirteenth International Conference on Learning Representations,

  4. [4]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Dao, T. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

  5. [5]

    LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

    Ding, Y ., Zhang, L. L., Zhang, C., Xu, Y ., Shang, N., Xu, J., Yang, F., and Yang, M. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753,

  6. [6]

    Dodge, M

    Dodge, J., Sap, M., Marasovi ´c, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.),Pro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–...

  7. [7]

    GLM-5 Team

    URL https://zenodo.org/records/12608602. Gholami, A., Yao, Z., Kim, S., Hooper, C., Mahoney, M. W., and Keutzer, K. Ai and memory wall.IEEE Micro, 44 (3):33–39,

  8. [8]

    The Llama 3 Herd of Models

    URL https://arxiv.org/abs/2407.21783. Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y . S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm infer- ence with kv cache quantization,

  9. [9]

    Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B

    URL https: //arxiv.org/abs/2401.18079. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models?,

  10. [10]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    URLhttps://arxiv.org/abs/2404.06654. Ji, T., Guo, B., Wu, Y ., Guo, Q., Shen, L., Chen, Z., Qiu, X., Zhang, Q., and Gui, T. Towards economical inference: Enabling deepseek’s multi-head latent attention in any 11 STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control transformer-based llms.arXiv preprint arXiv:2502.14837,

  11. [11]

    Mistral 7B

    URL https: //arxiv.org/abs/2310.06825. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th sym- posium on operating systems principles, pp. 611–626,

  12. [12]

    How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Fol- lowing,

    Li, D., Shao, R., Xie, A., Sheng, Y ., Zheng, L., Gonzalez, J., Stoica, I., Ma, X., and Zhang, H. How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Fol- lowing,

  13. [13]

    Li, J., Zhang, Y ., Hassan, M

    URL https://openreview.net/ forum?id=LywifFNXV5. Li, J., Zhang, Y ., Hassan, M. Y ., Chafekar, T., Cai, T., Ren, Z., Guo, P., Karimzadeh, F., Wang, C., and Gan, C. Com- mvq: Commutative vector quantization for kv cache com- pression.arXiv preprint arXiv:2506.18879,

  14. [14]

    Matryoshkakv: Adaptive kv compression via trainable orthogonal projection.arXiv preprint arXiv:2410.14731,

    Lin, B., Zeng, Z., Xiao, Z., Kou, S., Hou, T., Gao, X., Zhang, H., and Deng, Z. Matryoshkakv: Adaptive kv compression via trainable orthogonal projection.arXiv preprint arXiv:2410.14731,

  15. [15]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, H., Yan, W., Zaharia, M., and Abbeel, P. World model on million-length video and language with blockwise ringattention. InThe...

  16. [16]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    URL https:// openreview.net/forum?id=HN8V0flwJF. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. Kivi: A tuning-free asym- metric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024b. Lozhkov, A., Ben Allal, L., von Werra, L., and Wolf, T. Fineweb-edu: the finest collection of educational content,

  17. [17]

    Pointer Sentinel Mixture Models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

  18. [18]

    Saxena, U., Saha, G., Choudhary, S., and Roy, K

    URL https://arxiv.org/abs/ 2510.24273. Saxena, U., Saha, G., Choudhary, S., and Roy, K. Eigen attention: Attention in low-rank space for kv cache com- pression,

  19. [19]

    Eigen attention: Attention in low-rank space for KV cache compression

    URL https://arxiv.org/abs/ 2408.05646. Shah, J., Bikshandi, G., Zhang, Y ., Thakkar, V ., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685,

  20. [20]

    doi: 10.1016/j.neucom

    ISSN 0925-2312. doi: 10.1016/j.neucom. 2023.127063. URL https://doi.org/10.1016/ j.neucom.2023.127063. Su, Y ., Zhou, Y ., Qiu, Q., Li, J., Xia, Q., Li, P., Duan, X., Wang, Z., and Zhang, M. Accurate kv cache quan- tization with outlier tokens tracing.arXiv preprint arXiv:2505.10938,

  21. [21]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    URL https://arxiv.org/abs/2307.09288. Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and De Sa, C. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

  22. [22]

    Efficient Streaming Language Models with Attention Sinks

    URL https://arxiv.org/abs/ 2309.17453. Yan, X., Li, Z., Zhang, T., Qin, H., Kong, L., Zhang, Y ., and Yang, X. Recalkv: Low-rank kv cache compression via head reordering and offline calibration,

  23. [23]

    Recalkv: Low-rank kv cache compression via head reordering and offline calibration, 2025

    URL https://arxiv.org/abs/2505.24357. Zhang, R., Wang, K., Liu, L., Wang, S., Cheng, H., Zhang, C., and Shen, Y . Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111,

  24. [24]

    H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    URL https: //arxiv.org/abs/2306.14048. 13 STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control A. Appendix A.1. Ablation Studies A.1.1. STATIC VS. ADAPTIVERANKSELECTION We compare adaptive rank selection against static low-rank compression. Static rank selection assigns auniformrank budget across all decoder blocks and at...

  25. [25]

    10 and Fig

    Fig. 10 and Fig. 11 present the rank maps learned by the key and value projections, respectively, using our adaptive rank selection strategy based on the soft-thresholding mechanism. 14 STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control A.1.2. STABILITY OFLEARNEDRANKPROFILES We further analyze whether the learned rank p...

  26. [26]

    Table 9 isolates the roles of structured rotation and mixed-precision quantization

    the proposed block-wise Hadamard with mixed precision. Table 9 isolates the roles of structured rotation and mixed-precision quantization. Applying a single global Hadamard with uniform 3-bit quantization improves robustness over naive quantization but still exhibits noticeable accuracy degradation, as it cannot selectively protect high-magnitude channels...

  27. [27]

    On LLaMA-2-13B, STAR-KV substantially outperforms Palu at 60% compression and remains competitive even at 75% compression

    On Mistral-7B-Instruct-v0.2, STAR-KV achieves a higher zero-shot average than ReCalKV at the same 60% KV cache compression rate. On LLaMA-2-13B, STAR-KV substantially outperforms Palu at 60% compression and remains competitive even at 75% compression. Model Method Comp (%) Wiki2 C4 LM Avg. OBQA PIQA ARC-e ARC-c Hella Wino Avg. Mistral-7B-Inst. Baseline 0 ...

  28. [28]

    At a context length of 128K, STAR-KV achieves a 2.9×speedup, and adding 4-bit KV quantization increases the speedup to 4.3×

    Here, we limit the context length and batch size to a value that fits inside the GPU memory with the PyTorch SDPA baseline. At a context length of 128K, STAR-KV achieves a 2.9×speedup, and adding 4-bit KV quantization increases the speedup to 4.3×. A.6. Implementation Optimizations Operation Fusion.RoPE is often implemented as bandwidth-bound element-wise...