KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Chiara Boretti; Hyun-Min Chang; Jiawei Zhuang; Lorenz K. Muller; Lukas Cavigelli; Philippe Bich

arxiv: 2606.03458 · v1 · pith:H5ARCNTUnew · submitted 2026-06-02 · 💻 cs.LG

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Lorenz K. Muller , Philippe Bich , Chiara Boretti , Hyun-Min Chang , Jiawei Zhuang , Lukas Cavigelli This is my paper

Pith reviewed 2026-06-28 11:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV-cache quantizationerror accumulationvariance normalizationautoregressive decodingHadamard rotation2-bit precisionreasoning benchmarks

0 comments

The pith

Variance normalization after Hadamard rotation on KV caches prevents error accumulation during autoregressive reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that quantization errors in KV caches build up over successive decoding steps in autoregressive generation, driven mainly by wrong scales for individual tokens. KVarN counters this through a fixed sequence of Hadamard rotation followed by variance normalization along both axes of the K and V matrices, without any calibration data or model tuning. This procedure corrects outlying token scales and cuts error growth relative to prior quantizers. The result is stronger performance on generative reasoning benchmarks at 2-bit precision. A reader would care because the method targets a core memory constraint that limits how far test-time scaling can be pushed for improved model reasoning.

Core claim

In autoregressive decoding, KV-cache quantization errors accumulate across timesteps primarily due to incorrect per-token scales. KVarN applies a Hadamard rotation followed by dual-scaling variance normalization across both axes of the K and V matrices to fix outlying token-scale errors. This combination reduces error accumulation over existing baselines and establishes new state-of-the-art results on benchmarks including MATH500, AIME24, and HumanEval at 2-bit precision.

What carries the argument

Dual-scaling variance normalization applied after Hadamard rotation, which normalizes K and V matrices along both sequence and feature dimensions to correct token scales.

If this is right

Longer reasoning sequences become feasible at 2-bit KV precision without proportional accuracy loss.
KV-cache quantization no longer requires per-model calibration data.
Reasoning benchmarks such as MATH500 and HumanEval see improved results at low bit widths.
Memory bottlenecks during extended autoregressive decoding are alleviated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same normalization pattern could be tested on other cached states inside transformers to check for similar drift reduction.
Combining KVarN with complementary quantization methods might push viable precision below 2 bits on the same tasks.
The calibration-free property suggests easier deployment across varied model families without repeated tuning runs.
If the scale-correction mechanism holds, it points to a broader design principle for stabilizing sequential generation under quantization.

Load-bearing premise

That incorrect per-token scales are the primary driver of error accumulation and that a fixed variance normalization procedure can correct them reliably without calibration or tuning.

What would settle it

Measuring error accumulation rates on MATH500 or AIME24 after applying the Hadamard-plus-dual-variance-normalization procedure and finding no reduction relative to baselines would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.03458 by Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lorenz K. Muller, Lukas Cavigelli, Philippe Bich.

**Figure 2.** Figure 2: Schematic layout of KVarN. Every token is Hadamard-rotated in the channel dimension. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Replacing the 5% worst outlier errors with high precision values improves end-to-end KL-divergence more than fixing the other 95%, even though more MSE lies there (see [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Prior KV-Cache quantization papers address the parallel prefill scenario (red dashed arrows). [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Average reconstruction error after quantization across all attention layer outputs in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Speed measurement on GPU in the fast vLLM framework. The variance-normalization [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Arrangement of Hadamard transforms in an attention layer in our method. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Joint distribution of K magnitude and quantized K magnitude using different quantization methods. KVarN tightly controls token scales, while the baseline KIVI and ablations of our method have substantial off-diagonals. A Hadamard Transform Arrangement The arragement of Hadamard transforms we use is similar to QuaRot [2], see [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Plot to complement Fig. 3. It shows much MSE remains, if we replace the largest top k% [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Needle-in-a-Haystack on Qwen3-4B: KIVI vs. KVarN under Static and Accumulated [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Median dequantization time per call for KIVI (single scale) vs. KVarN (dual scale, [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KVarN pairs Hadamard rotation with dual-axis variance normalization to cut error growth in autoregressive KV caches and reports 2-bit gains on reasoning benchmarks, but the mechanism rests on limited supporting analysis.

read the letter

The paper introduces KVarN, a calibration-free quantizer that applies a Hadamard rotation followed by variance normalization along both axes of the K and V matrices. The core idea is that autoregressive decoding makes per-token scale errors the main driver of accumulation, and this fixed procedure corrects them.

What stands out is the shift in focus. Prior KV quantization work often targets prefill settings, while this one measures behavior across long decoding steps on generative tasks. The authors report better accuracy than baselines on MATH500, AIME24, and HumanEval at 2 bits, and they release a vLLM implementation, which makes the result immediately usable.

The soft spots are in the evidence for the claimed mechanism. The abstract states that incorrect token scales drive accumulation and that the dual scaling fixes outlying errors, yet it supplies no error-accumulation curves over tokens and no ablation that isolates the dual normalization from the rotation. Baseline re-implementations also lack enough detail to judge fairness. These gaps leave open whether the gains come from the stated scale correction or from other properties of the transform.

The work targets practitioners who need to stretch test-time compute on fixed hardware for reasoning models. A reader building or tuning long-horizon inference pipelines would find the method and code directly relevant.

It deserves a serious referee. The engineering contribution is concrete, the benchmarks are standard, and the calibration-free property is a practical plus. Reviewers can ask for the missing curves and ablations, but the paper is coherent enough to warrant that step.

Recommendation: send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper claims that KV-cache quantization errors in autoregressive decoding accumulate primarily due to incorrect per-token scales, unlike in prefill settings. It introduces KVarN, a calibration-free quantizer that applies a Hadamard rotation followed by dual-scaling variance normalization across both axes of the K and V matrices. This is reported to fix outlying token-scale errors, reduce accumulation relative to baselines, and achieve new state-of-the-art results on generative reasoning benchmarks (MATH500, AIME24, HumanEval) at 2-bit precision, with a vLLM implementation provided.

Significance. If the central attribution to token-scale errors and the calibration-free generalization hold, the work would be significant for memory-efficient long-context inference in LLMs, particularly enabling test-time scaling on reasoning tasks. The calibration-free nature and public implementation are concrete strengths that would facilitate adoption if the error-mitigation mechanism is rigorously validated.

major comments (3)

[Results] Results section: The manuscript reports benchmark improvements but provides no quantitative error-accumulation curves (e.g., per-timestep scale deviation, MSE on KV states, or attention-score drift over decoding steps) for KVarN versus baselines. This absence leaves the load-bearing claim—that accumulation is driven primarily by incorrect token scales and is fixed by the dual normalization—without direct empirical support.
[Ablation] Ablation or method evaluation: There is no ablation isolating the dual-variance normalization (across both K/V axes) from the Hadamard rotation alone, nor any test of whether the fixed normalization generalizes without per-model statistics. This is required to substantiate that the procedure corrects scales without calibration or model-specific tuning on the reported 2-bit SOTA results.
[Experiments] Experimental setup: Details on baseline re-implementations (e.g., whether prior KV quantizers were adapted to the autoregressive regime with identical 2-bit settings and no hidden calibration) are insufficient to rule out implementation differences as the source of the reported gains on MATH500/AIME24/HumanEval.

minor comments (2)

[Method] Notation for the dual-scaling factors (e.g., how variance is computed per-axis and applied) could be clarified with an explicit equation in the method section to aid reproducibility.
[Figures] Figure captions for any error or benchmark plots should explicitly state the number of decoding steps and models evaluated to allow direct comparison with the accumulation claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the empirical support and clarity of the work.

read point-by-point responses

Referee: [Results] Results section: The manuscript reports benchmark improvements but provides no quantitative error-accumulation curves (e.g., per-timestep scale deviation, MSE on KV states, or attention-score drift over decoding steps) for KVarN versus baselines. This absence leaves the load-bearing claim—that accumulation is driven primarily by incorrect token scales and is fixed by the dual normalization—without direct empirical support.

Authors: We agree that explicit quantitative curves would provide stronger direct support for the central claim on token-scale errors. In the revised manuscript we will add per-timestep analyses, including scale deviation, MSE on KV states, and attention-score drift, comparing KVarN against baselines. revision: yes
Referee: [Ablation] Ablation or method evaluation: There is no ablation isolating the dual-variance normalization (across both K/V axes) from the Hadamard rotation alone, nor any test of whether the fixed normalization generalizes without per-model statistics. This is required to substantiate that the procedure corrects scales without calibration or model-specific tuning on the reported 2-bit SOTA results.

Authors: We acknowledge that an ablation isolating the dual-variance normalization from the Hadamard rotation, together with tests of generalization without per-model statistics, would strengthen the claims. We will add such an ablation study and cross-model generalization experiments in the revision. revision: yes
Referee: [Experiments] Experimental setup: Details on baseline re-implementations (e.g., whether prior KV quantizers were adapted to the autoregressive regime with identical 2-bit settings and no hidden calibration) are insufficient to rule out implementation differences as the source of the reported gains on MATH500/AIME24/HumanEval.

Authors: We will expand the experimental setup section to include detailed descriptions of baseline re-implementations, confirming adaptation to the autoregressive regime, identical 2-bit settings, and absence of hidden calibration steps beyond those specified in the original works. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is empirical construction validated on external benchmarks

full rationale

The paper presents KVarN as a calibration-free engineering construction (Hadamard rotation followed by dual-axis variance normalization) whose performance is measured directly on external generative benchmarks including MATH500, AIME24 and HumanEval. The observation that errors accumulate due to incorrect token scales is stated as an empirical finding from analysis of the autoregressive regime, not derived from any self-referential equation or fitted parameter defined inside the paper. No load-bearing steps reduce by construction to inputs, no self-citations are invoked for uniqueness or ansatzes, and the central claim does not rename known results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard linear-algebra operations (Hadamard rotation, per-axis variance normalization) whose correctness is taken from prior numerical linear algebra; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (2)

standard math Hadamard rotation preserves the numerical properties needed for subsequent per-axis variance normalization in floating-point arithmetic.
Invoked implicitly when the paper states the rotation is applied before normalization.
domain assumption Variance normalization across token and feature axes can be computed stably from the quantized matrix without additional calibration data.
Required for the claim that the method is calibration-free.

pith-pipeline@v0.9.1-grok · 5730 in / 1388 out tokens · 21452 ms · 2026-06-28T11:34:35.910708+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

2024
[3]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

2024
[4]

Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPr...

2025
[5]

PyramidKV: Dynamic KV cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025

2025
[6]

Quip: 2-bit quanti- zation of large language models with guarantees.Advances in neural information processing systems, 36:4396–4429, 2023

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quanti- zation of large language models with guarantees.Advances in neural information processing systems, 36:4396–4429, 2023

2023
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

2021
[8]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

KVQuant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

2024
[10]

Needle in a haystack-pressure testing llms

Greg Kamradt. Needle in a haystack-pressure testing llms. https://github.com/gkamradt/ LLMTest_NeedleInAHaystack, 2023

2023
[11]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[12]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 10

2023
[13]

How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

2023
[14]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 2294...

2024
[15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

2024
[16]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Spinquant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[18]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Process...

2023
[19]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache.Proceedings of Machine Learning Research, 235:32332–32344, 2024

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.Proceedings of Machine Learning Research, 235:32332–32344, 2024

2024
[20]

Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Pushing the limits of large language model quantization via the linearity theorem. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

2025
[21]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20275–20321, 2025

2025
[22]

Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights.arXiv preprint arXiv:2509.22944, 2025

Lorenz K Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, and Lukas Cavigelli. Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights.arXiv preprint arXiv:2509.22944, 2025

work page arXiv 2025
[23]

Learning to reason with llms, September 2024

OpenAI. Learning to reason with llms, September 2024. Accessed: May 2026

2024
[24]

Thinking vs

Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Im- proving agent reasoning by scaling test-time interaction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[25]

Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[26]

Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba

Shriyank Somvanshi, Md Monzurul Islam, Mahmuda Sultana Mimi, Sazzad Bin Bashar Polock, Gaurab Chhetri, and Subasish Das. From s4 to mamba: A comprehensive survey on structured state space models.arXiv preprint arXiv:2503.18970, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

MultiQA: An empirical investigation of generalization and transfer in reading comprehension

Alon Talmor and Jonathan Berant. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4911–4921, Florence, Italy, July 2019. Association for Computational Linguistics

2019
[28]

Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

2024
[29]

D2O: Dynamic discriminative operations for efficient long-context inference of large language models

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and Mi Zhang. D2O: Dynamic discriminative operations for efficient long-context inference of large language models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[30]

Polarquant: Leveraging polar transformation for key cache quantization and decoding acceleration

Songhao Wu, Ang Lv, xiao feng, Yufei zhang, Xun Zhang, Guojun Yin, Wei Lin, and Rui Yan. Polarquant: Leveraging polar transformation for key cache quantization and decoding acceleration. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[31]

Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, et al. Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

work page arXiv 2025
[32]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

2025
[33]

Turboquant: Online vector quantization with near-optimal distortion rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[34]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

2024
[35]

Cam: Cache merging for memory-efficient LLMs inference

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient LLMs inference. InForty-first International Conference on Machine Learning, 2024

2024
[36]

H 2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. H 2O: Heavy-hitter oracle for efficient generative inference of large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in N...

2023
[37]

The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. 12 Wq RoPE(·) H Wk RoPE(·) H Wv H Quantize(·) Quantize(·) Cache Cache Dequantize(·) Dequantize(·) × Softmax(·) × H−1 Wox Figure 7: Arrangement of Hadamard transforms in an attention la...

2023

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

2024

[3] [3]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

2024

[4] [4]

Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPr...

2025

[5] [5]

PyramidKV: Dynamic KV cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025

2025

[6] [6]

Quip: 2-bit quanti- zation of large language models with guarantees.Advances in neural information processing systems, 36:4396–4429, 2023

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quanti- zation of large language models with guarantees.Advances in neural information processing systems, 36:4396–4429, 2023

2023

[7] [7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

2021

[8] [8]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

KVQuant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

2024

[10] [10]

Needle in a haystack-pressure testing llms

Greg Kamradt. Needle in a haystack-pressure testing llms. https://github.com/gkamradt/ LLMTest_NeedleInAHaystack, 2023

2023

[11] [11]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[12] [12]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 10

2023

[13] [13]

How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

2023

[14] [14]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 2294...

2024

[15] [15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

2024

[16] [16]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Spinquant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[18] [18]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Process...

2023

[19] [19]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache.Proceedings of Machine Learning Research, 235:32332–32344, 2024

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.Proceedings of Machine Learning Research, 235:32332–32344, 2024

2024

[20] [20]

Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Pushing the limits of large language model quantization via the linearity theorem. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

2025

[21] [21]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20275–20321, 2025

2025

[22] [22]

Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights.arXiv preprint arXiv:2509.22944, 2025

Lorenz K Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, and Lukas Cavigelli. Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights.arXiv preprint arXiv:2509.22944, 2025

work page arXiv 2025

[23] [23]

Learning to reason with llms, September 2024

OpenAI. Learning to reason with llms, September 2024. Accessed: May 2026

2024

[24] [24]

Thinking vs

Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Im- proving agent reasoning by scaling test-time interaction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[25] [25]

Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[26] [26]

Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba

Shriyank Somvanshi, Md Monzurul Islam, Mahmuda Sultana Mimi, Sazzad Bin Bashar Polock, Gaurab Chhetri, and Subasish Das. From s4 to mamba: A comprehensive survey on structured state space models.arXiv preprint arXiv:2503.18970, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

MultiQA: An empirical investigation of generalization and transfer in reading comprehension

Alon Talmor and Jonathan Berant. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4911–4921, Florence, Italy, July 2019. Association for Computational Linguistics

2019

[28] [28]

Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

2024

[29] [29]

D2O: Dynamic discriminative operations for efficient long-context inference of large language models

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and Mi Zhang. D2O: Dynamic discriminative operations for efficient long-context inference of large language models. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[30] [30]

Polarquant: Leveraging polar transformation for key cache quantization and decoding acceleration

Songhao Wu, Ang Lv, xiao feng, Yufei zhang, Xun Zhang, Guojun Yin, Wei Lin, and Rui Yan. Polarquant: Leveraging polar transformation for key cache quantization and decoding acceleration. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[31] [31]

Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, et al. Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

work page arXiv 2025

[32] [32]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

2025

[33] [33]

Turboquant: Online vector quantization with near-optimal distortion rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[34] [34]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

2024

[35] [35]

Cam: Cache merging for memory-efficient LLMs inference

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient LLMs inference. InForty-first International Conference on Machine Learning, 2024

2024

[36] [36]

H 2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. H 2O: Heavy-hitter oracle for efficient generative inference of large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in N...

2023

[37] [37]

The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. 12 Wq RoPE(·) H Wk RoPE(·) H Wv H Quantize(·) Quantize(·) Cache Cache Dequantize(·) Dequantize(·) × Softmax(·) × H−1 Wox Figure 7: Arrangement of Hadamard transforms in an attention la...

2023