pith. sign in

arxiv: 2606.03458 · v1 · pith:H5ARCNTUnew · submitted 2026-06-02 · 💻 cs.LG

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Pith reviewed 2026-06-28 11:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords KV-cache quantizationerror accumulationvariance normalizationautoregressive decodingHadamard rotation2-bit precisionreasoning benchmarks
0
0 comments X

The pith

Variance normalization after Hadamard rotation on KV caches prevents error accumulation during autoregressive reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that quantization errors in KV caches build up over successive decoding steps in autoregressive generation, driven mainly by wrong scales for individual tokens. KVarN counters this through a fixed sequence of Hadamard rotation followed by variance normalization along both axes of the K and V matrices, without any calibration data or model tuning. This procedure corrects outlying token scales and cuts error growth relative to prior quantizers. The result is stronger performance on generative reasoning benchmarks at 2-bit precision. A reader would care because the method targets a core memory constraint that limits how far test-time scaling can be pushed for improved model reasoning.

Core claim

In autoregressive decoding, KV-cache quantization errors accumulate across timesteps primarily due to incorrect per-token scales. KVarN applies a Hadamard rotation followed by dual-scaling variance normalization across both axes of the K and V matrices to fix outlying token-scale errors. This combination reduces error accumulation over existing baselines and establishes new state-of-the-art results on benchmarks including MATH500, AIME24, and HumanEval at 2-bit precision.

What carries the argument

Dual-scaling variance normalization applied after Hadamard rotation, which normalizes K and V matrices along both sequence and feature dimensions to correct token scales.

If this is right

  • Longer reasoning sequences become feasible at 2-bit KV precision without proportional accuracy loss.
  • KV-cache quantization no longer requires per-model calibration data.
  • Reasoning benchmarks such as MATH500 and HumanEval see improved results at low bit widths.
  • Memory bottlenecks during extended autoregressive decoding are alleviated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same normalization pattern could be tested on other cached states inside transformers to check for similar drift reduction.
  • Combining KVarN with complementary quantization methods might push viable precision below 2 bits on the same tasks.
  • The calibration-free property suggests easier deployment across varied model families without repeated tuning runs.
  • If the scale-correction mechanism holds, it points to a broader design principle for stabilizing sequential generation under quantization.

Load-bearing premise

That incorrect per-token scales are the primary driver of error accumulation and that a fixed variance normalization procedure can correct them reliably without calibration or tuning.

What would settle it

Measuring error accumulation rates on MATH500 or AIME24 after applying the Hadamard-plus-dual-variance-normalization procedure and finding no reduction relative to baselines would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.03458 by Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lorenz K. Muller, Lukas Cavigelli, Philippe Bich.

Figure 1
Figure 1. Figure 1: (a) What fraction of the top k% largest errors is due to magnitude rather than direction [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic layout of KVarN. Every token is Hadamard-rotated in the channel dimension. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Replacing the 5% worst out￾lier errors with high precision values im￾proves end-to-end KL-divergence more than fixing the other 95%, even though more MSE lies there (see [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prior KV-Cache quantization papers address the parallel prefill scenario (red dashed arrows). [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average reconstruction error after quantization across all attention layer outputs in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Speed measurement on GPU in the fast vLLM framework. The variance-normalization [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Arrangement of Hadamard transforms in an attention layer in our method. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Joint distribution of K magnitude and quantized K magnitude using different quantization methods. KVarN tightly controls token scales, while the baseline KIVI and ablations of our method have substantial off-diagonals. A Hadamard Transform Arrangement The arragement of Hadamard transforms we use is similar to QuaRot [2], see [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Plot to complement Fig. 3. It shows much MSE remains, if we replace the largest top k% [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Needle-in-a-Haystack on Qwen3-4B: KIVI vs. KVarN under Static and Accumulated [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Median dequantization time per call for KIVI (single scale) vs. KVarN (dual scale, [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that KV-cache quantization errors in autoregressive decoding accumulate primarily due to incorrect per-token scales, unlike in prefill settings. It introduces KVarN, a calibration-free quantizer that applies a Hadamard rotation followed by dual-scaling variance normalization across both axes of the K and V matrices. This is reported to fix outlying token-scale errors, reduce accumulation relative to baselines, and achieve new state-of-the-art results on generative reasoning benchmarks (MATH500, AIME24, HumanEval) at 2-bit precision, with a vLLM implementation provided.

Significance. If the central attribution to token-scale errors and the calibration-free generalization hold, the work would be significant for memory-efficient long-context inference in LLMs, particularly enabling test-time scaling on reasoning tasks. The calibration-free nature and public implementation are concrete strengths that would facilitate adoption if the error-mitigation mechanism is rigorously validated.

major comments (3)
  1. [Results] Results section: The manuscript reports benchmark improvements but provides no quantitative error-accumulation curves (e.g., per-timestep scale deviation, MSE on KV states, or attention-score drift over decoding steps) for KVarN versus baselines. This absence leaves the load-bearing claim—that accumulation is driven primarily by incorrect token scales and is fixed by the dual normalization—without direct empirical support.
  2. [Ablation] Ablation or method evaluation: There is no ablation isolating the dual-variance normalization (across both K/V axes) from the Hadamard rotation alone, nor any test of whether the fixed normalization generalizes without per-model statistics. This is required to substantiate that the procedure corrects scales without calibration or model-specific tuning on the reported 2-bit SOTA results.
  3. [Experiments] Experimental setup: Details on baseline re-implementations (e.g., whether prior KV quantizers were adapted to the autoregressive regime with identical 2-bit settings and no hidden calibration) are insufficient to rule out implementation differences as the source of the reported gains on MATH500/AIME24/HumanEval.
minor comments (2)
  1. [Method] Notation for the dual-scaling factors (e.g., how variance is computed per-axis and applied) could be clarified with an explicit equation in the method section to aid reproducibility.
  2. [Figures] Figure captions for any error or benchmark plots should explicitly state the number of decoding steps and models evaluated to allow direct comparison with the accumulation claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the empirical support and clarity of the work.

read point-by-point responses
  1. Referee: [Results] Results section: The manuscript reports benchmark improvements but provides no quantitative error-accumulation curves (e.g., per-timestep scale deviation, MSE on KV states, or attention-score drift over decoding steps) for KVarN versus baselines. This absence leaves the load-bearing claim—that accumulation is driven primarily by incorrect token scales and is fixed by the dual normalization—without direct empirical support.

    Authors: We agree that explicit quantitative curves would provide stronger direct support for the central claim on token-scale errors. In the revised manuscript we will add per-timestep analyses, including scale deviation, MSE on KV states, and attention-score drift, comparing KVarN against baselines. revision: yes

  2. Referee: [Ablation] Ablation or method evaluation: There is no ablation isolating the dual-variance normalization (across both K/V axes) from the Hadamard rotation alone, nor any test of whether the fixed normalization generalizes without per-model statistics. This is required to substantiate that the procedure corrects scales without calibration or model-specific tuning on the reported 2-bit SOTA results.

    Authors: We acknowledge that an ablation isolating the dual-variance normalization from the Hadamard rotation, together with tests of generalization without per-model statistics, would strengthen the claims. We will add such an ablation study and cross-model generalization experiments in the revision. revision: yes

  3. Referee: [Experiments] Experimental setup: Details on baseline re-implementations (e.g., whether prior KV quantizers were adapted to the autoregressive regime with identical 2-bit settings and no hidden calibration) are insufficient to rule out implementation differences as the source of the reported gains on MATH500/AIME24/HumanEval.

    Authors: We will expand the experimental setup section to include detailed descriptions of baseline re-implementations, confirming adaptation to the autoregressive regime, identical 2-bit settings, and absence of hidden calibration steps beyond those specified in the original works. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is empirical construction validated on external benchmarks

full rationale

The paper presents KVarN as a calibration-free engineering construction (Hadamard rotation followed by dual-axis variance normalization) whose performance is measured directly on external generative benchmarks including MATH500, AIME24 and HumanEval. The observation that errors accumulate due to incorrect token scales is stated as an empirical finding from analysis of the autoregressive regime, not derived from any self-referential equation or fitted parameter defined inside the paper. No load-bearing steps reduce by construction to inputs, no self-citations are invoked for uniqueness or ansatzes, and the central claim does not rename known results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard linear-algebra operations (Hadamard rotation, per-axis variance normalization) whose correctness is taken from prior numerical linear algebra; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (2)
  • standard math Hadamard rotation preserves the numerical properties needed for subsequent per-axis variance normalization in floating-point arithmetic.
    Invoked implicitly when the paper states the rotation is applied before normalization.
  • domain assumption Variance normalization across token and feature axes can be computed stably from the quantized matrix without additional calibration data.
    Required for the claim that the method is calibration-free.

pith-pipeline@v0.9.1-grok · 5730 in / 1388 out tokens · 21452 ms · 2026-06-28T11:34:35.910708+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

  3. [3]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

  4. [4]

    Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning

    Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPr...

  5. [5]

    PyramidKV: Dynamic KV cache compression based on pyramidal information funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025

  6. [6]

    Quip: 2-bit quanti- zation of large language models with guarantees.Advances in neural information processing systems, 36:4396–4429, 2023

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quanti- zation of large language models with guarantees.Advances in neural information processing systems, 36:4396–4429, 2023

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    KVQuant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

  10. [10]

    Needle in a haystack-pressure testing llms

    Greg Kamradt. Needle in a haystack-pressure testing llms. https://github.com/gkamradt/ LLMTest_NeedleInAHaystack, 2023

  11. [11]

    Lee, Sangdoo Yun, and Hyun Oh Song

    Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  12. [12]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 10

  13. [13]

    How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

    Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

  14. [14]

    Snapkv: Llm knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 2294...

  15. [15]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

  16. [16]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  17. [17]

    Spinquant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

  18. [18]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Process...

  19. [19]

    Kivi: A tuning-free asymmetric 2bit quantization for kv cache.Proceedings of Machine Learning Research, 235:32332–32344, 2024

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.Proceedings of Machine Learning Research, 235:32332–32344, 2024

  20. [20]

    Pushing the limits of large language model quantization via the linearity theorem

    Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Pushing the limits of large language model quantization via the linearity theorem. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

  21. [21]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20275–20321, 2025

  22. [22]

    Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights.arXiv preprint arXiv:2509.22944, 2025

    Lorenz K Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, and Lukas Cavigelli. Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights.arXiv preprint arXiv:2509.22944, 2025

  23. [23]

    Learning to reason with llms, September 2024

    OpenAI. Learning to reason with llms, September 2024. Accessed: May 2026

  24. [24]

    Thinking vs

    Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Im- proving agent reasoning by scaling test-time interaction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  25. [25]

    Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  26. [26]

    Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba

    Shriyank Somvanshi, Md Monzurul Islam, Mahmuda Sultana Mimi, Sazzad Bin Bashar Polock, Gaurab Chhetri, and Subasish Das. From s4 to mamba: A comprehensive survey on structured state space models.arXiv preprint arXiv:2503.18970, 2025. 11

  27. [27]

    MultiQA: An empirical investigation of generalization and transfer in reading comprehension

    Alon Talmor and Jonathan Berant. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4911–4921, Florence, Italy, July 2019. Association for Computational Linguistics

  28. [28]

    Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

    Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

  29. [29]

    D2O: Dynamic discriminative operations for efficient long-context inference of large language models

    Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and Mi Zhang. D2O: Dynamic discriminative operations for efficient long-context inference of large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  30. [30]

    Polarquant: Leveraging polar transformation for key cache quantization and decoding acceleration

    Songhao Wu, Ang Lv, xiao feng, Yufei zhang, Xun Zhang, Guojun Yin, Wei Lin, and Rui Yan. Polarquant: Leveraging polar transformation for key cache quantization and decoding acceleration. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  31. [31]

    Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

    Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, et al. Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

  32. [32]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  33. [33]

    Turboquant: Online vector quantization with near-optimal distortion rate

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. InThe Fourteenth International Conference on Learning Representations, 2026

  34. [34]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

  35. [35]

    Cam: Cache merging for memory-efficient LLMs inference

    Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient LLMs inference. InForty-first International Conference on Machine Learning, 2024

  36. [36]

    H 2O: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. H 2O: Heavy-hitter oracle for efficient generative inference of large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in N...

  37. [37]

    The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. 12 Wq RoPE(·) H Wk RoPE(·) H Wv H Quantize(·) Quantize(·) Cache Cache Dequantize(·) Dequantize(·) × Softmax(·) × H−1 Wox Figure 7: Arrangement of Hadamard transforms in an attention la...