pith. machine review for the scientific record. sign in

arxiv: 2605.02905 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.IT· math.IT

Recognition: no theorem link

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:13 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT
keywords KV cache compressionspectral denoisingoptimal shrinkagequantizationtransformer inferencelow-rank approximationrandom matrix modelattention heads
0
0 comments X

The pith

KV cache compression reaches near-lossless quality at 2.2 bits per entry by first denoising shared structure then quantizing the residual.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the key-value cache in each attention head naturally splits into a low-rank part shared across tokens and a full-rank residual unique to each token. This split follows a spiked random matrix model, so an optimal shrinkage step can automatically pull out the shared part. The remaining residual has delocalized coordinates and near-zero inner-product bias, letting a simple per-vector quantizer achieve near-optimal distortion without extra corrections for outliers. If this holds, models can keep high accuracy on long-context and retrieval tasks while using far less memory during inference.

Core claim

The central claim is that modeling the KV cache with the spiked random matrix model enables a two-stage method called eOptShrinkQ: optimal singular value shrinkage first extracts the low-rank shared context with automatic rank selection via the BBP phase transition, after which the residual satisfies the thin-shell property and can be quantized by TurboQuant to near-optimal distortion with provably near-zero inner-product bias. This pipeline removes the need for outlier handling or bias correction, freeing bits for better reconstruction.

What carries the argument

eOptShrink, the optimal singular value shrinkage operator that uses the BBP phase transition of the spiked random matrix model to separate low-rank shared context from the full-rank per-token residual.

If this is right

  • Automatic rank selection occurs without manual tuning through the BBP phase transition.
  • The residual after shrinkage has near-zero inner-product bias, eliminating the need for dedicated correction terms.
  • Coordinate delocalization in the residual yields near-optimal quantization distortion with a per-vector scalar quantizer.
  • At approximately 2.2 bits per entry the method outperforms prior 3-bit quantization on the 16-task LongBench suite.
  • At 2.2 bits the compressed cache matches or exceeds uncompressed FP16 accuracy on multi-needle retrieval tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition may apply to KV caches in other attention-based architectures, not only decoder-only transformers.
  • Spectral denoising could serve as an implicit regularizer that improves robustness on retrieval-heavy workloads.
  • Lower memory per token would allow either longer contexts or larger batch sizes at fixed hardware limits.
  • Similar shrinkage-plus-quantization pipelines might extend to compressing intermediate activations in other layers.

Load-bearing premise

The key-value cache in transformer attention heads admits a natural split into a low-rank shared context component and a full-rank per-token residual that is well described by the spiked random matrix model.

What would settle it

Measure the singular-value spectrum of real KV caches from Llama-3.1 or Ministral; if the largest eigenvalues do not show a clear BBP phase transition separating signal from noise, or if the post-shrinkage residual exhibits large inner-product bias on actual attention computations, the decomposition does not hold.

Figures

Figures reproduced from arXiv: 2605.02905 by Pei-Chun Su.

Figure 1
Figure 1. Figure 1: Llama-3.1-8B key head L15H3 (128 × 128). Top: Singular value spectrum (left) and distribution (right); outlier singular values clearly separate above the bulk edge q bλ+. Bottom: The original block Se exhibits vertical stripes from per-channel outliers—the anisotropy that KIVI’s per-channel quantization is designed to handle. The rank-ˆr + signal Sˆ captures these stripes; the residual R = Se − Sˆ is visua… view at source ↗
Figure 2
Figure 2. Figure 2: Llama-3.1-8B value head L31H0 (128 × 128). Same format as [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ministral-8B key head L5H3 (128 × 128). Same format as [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ministral-8B value head L18H1 (128 × 128). Same format as [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-needle retrieval heatmaps for Llama-3.1-8B ( [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-needle retrieval heatmaps for Ministral-8B ( [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matrix model. This observation leads to eOptShrinkQ, a two-stage compression pipeline: optimal singular value shrinkage (eOptShrink) automatically extracts the shared structure, and the residual -- which satisfies the \emph{thin shell property} with delocalized coordinates -- is quantized by TurboQuant~\citep{zandieh2025turboquant}, a recently proposed per-vector scalar quantizer with near-optimal distortion guarantees. By restoring the isotropy that scalar quantization assumes, spectral denoising eliminates the need for both outlier handling and dedicated inner product bias correction, freeing those bits for improved reconstruction. The theoretical grounding in random matrix theory provides three guarantees: automatic rank selection via the BBP phase transition, provably near-zero inner product bias on the residual, and coordinate delocalization ensuring near-optimal quantization distortion. Experimentally, we validate eOptShrinkQ on Llama-3.1-8B and Ministral-8B across three levels: per-head MSE and inner product fidelity, where eOptShrinkQ saves nearly one bit per entry over TurboQuant at equivalent quality; end-to-end on LongBench (16 tasks), where eOptShrinkQ at $\sim$2.2 bits per entry outperforms TurboQuant at 3.0 bits; and multi-needle retrieval, where eOptShrinkQ at 2.2 bits closely matches or exceeds uncompressed FP16, suggesting that spectral denoising can act as a beneficial regularizer for retrieval-intensive tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that KV caches in transformer attention heads decompose into a low-rank shared-context component and a full-rank per-token residual well-described by the spiked random matrix model. This leads to eOptShrinkQ: optimal singular-value shrinkage (eOptShrink) for automatic rank selection via the BBP transition, followed by TurboQuant quantization on the residual (which satisfies the thin-shell property). The method is asserted to deliver three RMT-based guarantees (automatic rank selection, near-zero inner-product bias, near-optimal quantization distortion) and to outperform TurboQuant at lower bit rates (~2.2 bpe vs 3.0 bpe) on LongBench while matching or exceeding FP16 on multi-needle retrieval.

Significance. If the spiked-model assumption and associated guarantees hold for real KV caches, the work would supply a principled, parameter-free compression pipeline that restores isotropy for scalar quantization and yields measurable bit savings with end-to-end task gains. The combination of spectral denoising and TurboQuant is a concrete advance over pure quantization baselines, and the reported LongBench and retrieval results suggest practical utility for long-context inference.

major comments (3)
  1. [Abstract / theoretical grounding] Abstract and theoretical section: the central claim that eOptShrink provides 'automatic rank selection via the BBP phase transition' and 'provably near-zero inner product bias' rests on the KV matrix exactly obeying the spiked covariance model with delocalized residual coordinates; however, the manuscript provides neither the explicit derivation of the shrinkage operator from the spiked model nor any empirical singular-value histograms or BBP-edge statistics on the actual Llama-3.1-8B / Ministral-8B KV caches used in the experiments. Without this validation the guarantees become heuristic rather than automatic.
  2. [Experiments / LongBench results] Experimental validation (per-head MSE / inner-product fidelity and LongBench sections): the reported ~1-bit saving over TurboQuant at equal quality and the outperformance at 2.2 bpe are attributed to the spectral step restoring isotropy; yet no ablation isolating eOptShrink from TurboQuant, no control for post-hoc rank choices, and no direct test of coordinate delocalization (e.g., participation ratio or thin-shell diagnostics) are shown. This makes it impossible to confirm that the bit-saving advantage is due to the claimed RMT properties rather than other implementation details.
  3. [Multi-needle retrieval experiments] Multi-needle retrieval results: the claim that eOptShrinkQ at 2.2 bits 'closely matches or exceeds uncompressed FP16' is load-bearing for the 'near-lossless' and 'beneficial regularizer' assertions, but the manuscript does not report variance across seeds, context lengths, or needle positions, nor does it compare against a simple low-rank baseline without quantization. These controls are necessary to substantiate that the spectral denoising step is responsible for the observed retrieval behavior.
minor comments (3)
  1. [Theoretical section] Notation for the shrinkage operator and the precise definition of the 'thin shell property' should be stated explicitly with reference to the cited RMT results rather than left implicit.
  2. [Figures / Tables] Figure captions and table headers for the LongBench and per-head fidelity plots should include the exact bit-rate calculation (including any overhead from rank metadata) and the number of heads / layers averaged.
  3. [Related work / Method] The manuscript cites TurboQuant but does not restate its distortion guarantee or the coordinate-delocalization condition required for optimality; a short self-contained paragraph would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We agree that additional theoretical derivations, ablations, and statistical controls will strengthen the manuscript and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [Abstract / theoretical grounding] Abstract and theoretical section: the central claim that eOptShrink provides 'automatic rank selection via the BBP phase transition' and 'provably near-zero inner product bias' rests on the KV matrix exactly obeying the spiked covariance model with delocalized residual coordinates; however, the manuscript provides neither the explicit derivation of the shrinkage operator from the spiked model nor any empirical singular-value histograms or BBP-edge statistics on the actual Llama-3.1-8B / Ministral-8B KV caches used in the experiments. Without this validation the guarantees become heuristic rather than automatic.

    Authors: We appreciate this point. In the revised manuscript we will add an explicit derivation of the eOptShrink operator from the spiked random matrix model, including the steps that link the BBP phase transition to automatic rank selection and the near-zero inner-product bias on the residual. We will also include new empirical figures with singular-value histograms and BBP-edge statistics computed directly on the KV caches extracted from the Llama-3.1-8B and Ministral-8B models used in the experiments. These additions will make the claimed guarantees both theoretically derived and empirically validated on the target data. revision: yes

  2. Referee: [Experiments / LongBench results] Experimental validation (per-head MSE / inner-product fidelity and LongBench sections): the reported ~1-bit saving over TurboQuant at equal quality and the outperformance at 2.2 bpe are attributed to the spectral step restoring isotropy; yet no ablation isolating eOptShrink from TurboQuant, no control for post-hoc rank choices, and no direct test of coordinate delocalization (e.g., participation ratio or thin-shell diagnostics) are shown. This makes it impossible to confirm that the bit-saving advantage is due to the claimed RMT properties rather than other implementation details.

    Authors: We agree that isolating the spectral step's contribution is essential. We will add a full ablation study comparing (i) TurboQuant alone, (ii) eOptShrink with full-precision residuals, and (iii) the combined eOptShrinkQ pipeline. We will also include controls that replace the BBP-based rank selection with post-hoc alternatives (fixed-rank truncation and heuristic thresholding) to demonstrate the advantage of the automatic RMT-driven choice. In addition, we will report participation ratios and thin-shell diagnostics on the per-token residuals to directly confirm coordinate delocalization and the thin-shell property. revision: yes

  3. Referee: [Multi-needle retrieval experiments] Multi-needle retrieval results: the claim that eOptShrinkQ at 2.2 bits 'closely matches or exceeds uncompressed FP16' is load-bearing for the 'near-lossless' and 'beneficial regularizer' assertions, but the manuscript does not report variance across seeds, context lengths, or needle positions, nor does it compare against a simple low-rank baseline without quantization. These controls are necessary to substantiate that the spectral denoising step is responsible for the observed retrieval behavior.

    Authors: We acknowledge the need for these controls. In the revision we will report means and standard deviations for the multi-needle retrieval task across multiple random seeds, a range of context lengths, and varied needle positions. We will also add a direct comparison against a simple low-rank baseline (truncated SVD without quantization) to isolate the contribution of the spectral denoising step. These additions will provide clearer evidence that the observed retrieval performance stems from the RMT-guided denoising acting as a beneficial regularizer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external RMT and cited quantizer

full rationale

The paper states the KV cache decomposition into low-rank shared context plus residual as an observation validated by the spiked random matrix model, then applies eOptShrink for automatic rank selection via the BBP transition and TurboQuant for the residual. These elements are drawn from external random matrix theory results and the independently cited TurboQuant work (zandieh2025turboquant), with no fitted parameters renamed as predictions, no self-definitional loops in the equations, and no load-bearing self-citations. The central performance claims on LongBench and retrieval are evaluated against external baselines rather than reducing to the inputs by construction. The approach is self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that KV cache matrices follow the spiked random matrix model, which supplies automatic rank selection and the thin-shell/delocalization properties used to justify quantization performance.

axioms (1)
  • domain assumption The key-value cache admits a natural decomposition into low-rank shared context and full-rank per-token residual described by the spiked random matrix model.
    This modeling choice is presented as the foundational observation enabling the entire pipeline.

pith-pipeline@v0.9.0 · 5613 in / 1437 out tokens · 89538 ms · 2026-05-10T20:13:39.196061+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 16 canonical work pages · 5 internal anchors

  1. [1]

    Diffu- sion geometry approach to efficiently remove electrical stimulation artifacts in intracranial electroencephalography.Journal of Neural Engineering, 16(3):036010, 2019

    Sankaraleengam Alagapan, Hae Won Shin, Flavio Fr¨ ohlich, and Hau-Tieng Wu. Diffu- sion geometry approach to efficiently remove electrical stimulation artifacts in intracranial electroencephalography.Journal of Neural Engineering, 16(3):036010, 2019

  2. [2]

    Rasoul Anvari, Mohammad Amir Nazari Siahsar, Saman Gholtashi, Aria Reza Kahoo, and Mohsen Mohammadi. Seismic random noise attenuation using synchrosqueezed wavelet transform and low-rank signal matrix approximation.IEEE Transactions on Geoscience and Remote Sensing, 55(11):6574–6581, 2018

  3. [3]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024. 28

  4. [4]

    Phase transition of the largest eigen- value for nonnull complex sample covariance matrices.Annals of Probability, 33(5):1643– 1697, 2005

    Jinho Baik, G´ erard Ben Arous, Sandrine P´ ech´ e, et al. Phase transition of the largest eigen- value for nonnull complex sample covariance matrices.Annals of Probability, 33(5):1643– 1697, 2005

  5. [5]

    The singular values and vectors of low rank perturbations of large rectangular random matrices.Journal of Multivariate Analysis, 111:120–135, 2012

    Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of low rank perturbations of large rectangular random matrices.Journal of Multivariate Analysis, 111:120–135, 2012

  6. [6]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  7. [7]

    Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong- Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu

    Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S Abdelfattah. xkv: Cross-layer svd for kv-cache compression.arXiv preprint arXiv:2503.18893, 2025

  8. [8]

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che

    Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024

  9. [9]

    Spiked separable covariance matrices and principal components

    Xiucai Ding and Fan Yang. Spiked separable covariance matrices and principal components. Annals of Statistics, 49(2):1113–1138, 2021

  10. [10]

    The optimal hard threshold for singular values is 4/ √ 3.arXiv preprint arXiv:1305.5870, 4, 2013

    David L Donoho and Matan Gavish. The optimal hard threshold for singular values is 4/ √ 3.arXiv preprint arXiv:1305.5870, 4, 2013

  11. [11]

    Optimal shrinkage of eigenvalues in the spiked covariance model.Annals of Statistics, 46(4):1742, 2018

    David L Donoho, Matan Gavish, and Iain M Johnstone. Optimal shrinkage of eigenvalues in the spiked covariance model.Annals of Statistics, 46(4):1742, 2018

  12. [12]

    Rabitq: Quantizing high-dimensional vectors with a the- oretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

    Jianyang Gao and Cheng Long. Rabitq: Quantizing high-dimensional vectors with a the- oretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    PolarQuant: Polar-Coordinate KV Cache Quantization,

    Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, and Amir Zandieh. Polar- quant: Quantizing kv caches with polar transformation.arXiv preprint arXiv:2502.02617, 2025

  15. [15]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

  16. [16]

    Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

    Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

  17. [17]

    Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

  18. [18]

    Optimal singular value shrinkage for operator norm loss: Extending to non-square matrices.Statistics & Probability Letters, 186:109472, 2022

    William Leeb. Optimal singular value shrinkage for operator norm loss: Extending to non-square matrices.Statistics & Probability Letters, 186:109472, 2022. 29

  19. [19]

    Optimal spectral shrinkage and PCA with heteroscedas- tic noise.IEEE Transactions on Information Theory, 67(5):3009–3037, 2021

    William Leeb and Elad Romanov. Optimal spectral shrinkage and PCA with heteroscedas- tic noise.IEEE Transactions on Information Theory, 67(5):3009–3037, 2021

  20. [20]

    Matrix denoising for weighted loss functions and heterogeneous signals

    William E Leeb. Matrix denoising for weighted loss functions and heterogeneous signals. SIAM Journal on Mathematics of Data Science, 3(3):987–1012, 2021

  21. [21]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

    Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

  22. [22]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

  23. [23]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sad´ e, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  24. [24]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

  25. [25]

    Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982

    Stuart Lloyd. Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982

  26. [26]

    Marchenko and Leonid A

    Vladimir A. Marchenko and Leonid A. Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

  27. [27]

    Quantizing for minimum distortion.IRE Transactions on Information Theory, 6(1):7–12, 1960

    Joel Max. Quantizing for minimum distortion.IRE Transactions on Information Theory, 6(1):7–12, 1960

  28. [28]

    Raj Rao Nadakuditi. Optshrink: An algorithm for improved low-rank signal matrix denois- ing by optimal, data-driven singular value shrinkage.IEEE Transactions on Information Theory, 60(5):3002–3018, 2014

  29. [29]

    Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,

    Piotr Nawrot, Adrian /suppress La´ ncucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636, 2024

  30. [30]

    Needle-in-the- haystack testing llms with a complex reasoning task

    Thomas Schuster, Marian Lambert, Nico D¨ oring, and Julius Tr¨ ogele. Needle-in-the- haystack testing llms with a complex reasoning task. InInternational Conference on Engineering Applications of Neural Networks, pages 254–266. Springer, 2025

  31. [31]

    Principled PCA separates signal from noise in omics count data.bioRxiv, 2025

    Jay S Stanley III, Junchen Yang, Ruiqi Li, Ofir Lindenbaum, Dmitry Kobak, Boris Landa, and Yuval Kluger. Principled PCA separates signal from noise in omics count data.bioRxiv, 2025

  32. [32]

    Recovery of the fetal electrocardiogram for morphological analysis from two trans-abdominal channels via optimal shrinkage.Physiological Measurement, 40(11):115005, 2019

    Pei-Chun Su, Stephen Miller, Salim Idriss, Piers Barker, and Hau-Tieng Wu. Recovery of the fetal electrocardiogram for morphological analysis from two trans-abdominal channels via optimal shrinkage.Physiological Measurement, 40(11):115005, 2019

  33. [33]

    Pei-Chun Su and Hau-Tieng Wu. Data-driven optimal shrinkage of singular values under high-dimensional noise with separable covariance structure with application.Applied and Computational Harmonic Analysis, 74:101698, 2025. 30

  34. [34]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks. Proceedings of machine learning research, 235:48630, 2024

  35. [35]

    Denoising of diffusion MRI using random matrix theory.NeuroImage, 142:394–406, 2016

    Jelle Veraart, Dmitry S Novikov, Daan Christiaens, Benjamin Ades-Aron, Jan Sijbers, and Els Fieremans. Denoising of diffusion MRI using random matrix theory.NeuroImage, 142:394–406, 2016

  36. [36]

    Cambridge University Press, 2018

    Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018

  37. [37]

    Cskv: Training-efficient channel shrinking for kv cache in long-context scenarios

    Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, and Yu Wang. Cskv: Training-efficient channel shrinking for kv cache in long-context scenarios. arXiv preprint arXiv:2409.10593, 2024

  38. [38]

    Edge universality of separable covariance matrices.Electronic Journal of Prob- ability, 24:1–57, 2019

    Fan Yang. Edge universality of separable covariance matrices.Electronic Journal of Prob- ability, 24:1–57, 2019

  39. [39]

    Svdq: 1.25-bit and 410x key cache compression for llm attention.arXiv preprint arXiv:2502.15304, 2025

    Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, and Yuan Mingx- uan. Svdq: 1.25-bit and 410x key cache compression for llm attention.arXiv preprint arXiv:2502.15304, 2025

  40. [40]

    Turboquant: Online vector quantization with near-optimal distortion rate,

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

  41. [41]

    OjaKV: Context-Aware Online Low-Rank KV Cache Compression

    Yuxuan Zhu, David H Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, and Pin-Yu Chen. Ojakv: Context-aware online low-rank kv cache compression with oja’s rule.arXiv preprint arXiv:2509.21623, 2025. 31