pith. machine review for the scientific record. sign in

arxiv: 2605.05699 · v1 · submitted 2026-05-07 · 💻 cs.PF · cs.AI

Recognition: unknown

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

Mohamed Amine Bergach

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:11 UTC · model grok-4.3

classification 💻 cs.PF cs.AI
keywords KV cacheint4 quantizationApple SiliconMetal kernelLLM inferencememory compressionfast Fourier transformperplexity
0
0 comments X

The pith

On Apple Silicon, a fused int4 KV cache kernel runs faster than fp16 while compressing memory by 3x and keeping quality intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that what is normally a quality versus latency trade-off in KV cache quantization becomes a net win on unified-memory hardware. A single Metal kernel fuses rotation via sign-randomized FFT, per-channel scaling, per-group absolute-max quantization, and int4 nibble packing, so that the added compute is smaller than the bandwidth saved by moving one-third as much data. On Gemma-3 1B this produces 3 to 8 percent lower milliseconds per token across 256 to 4096 token prefixes; on Qwen2.5-1.5B it yields small speedups at short context and sharply reduces a severe per-token quality collapse. The result is lower persistent memory use with perplexity changes near zero or modest.

Core claim

A single fused Metal kernel that performs sign-randomized FFT rotation, applies per-channel lambda scaling, computes per-group absolute-max quantization, and packs the result into int4 nibbles executes faster than an fp16 KV cache on Apple Silicon. It delivers three-fold compression of persistent KV memory while holding generation quality to within 0.000 delta PPL on short-prompt Qwen and +3.6 hook delta PPL on Gemma, and it reduces Qwen's extreme 4-bit per-token degradation from +7975 to +638.6 PPL.

What carries the argument

The fused Metal kernel that combines sign-randomized FFT rotation, per-channel lambda scaling, per-group abs-max quantization, and int4 nibble packing to keep all work inside one kernel launch and exploit unified-memory bandwidth savings.

Load-bearing premise

The fixed kernel overhead of roughly 25 nanoseconds per vector stays smaller than the time saved by moving three times less data through memory for the model dimensions and context lengths tested.

What would settle it

Measure end-to-end token generation time and perplexity on the same models but at contexts long enough that the relative overhead of the kernel exceeds the bandwidth reduction, or on hardware whose memory bandwidth is substantially higher.

Figures

Figures reproduced from arXiv: 2605.05699 by Mohamed Amine Bergach.

Figure 1
Figure 1. Figure 1: Quantization is free on Apple Silicon: int4 KV cache outpaces fp16 at every tested context length on Gemma-3 1B, and at short context on Qwen2.5-1.5B. (a) End-to-end model.generate latency vs fp16 baseline. Negative values mean int4 is faster : −3 to −8% on Gemma across 256 to 4096 prefix; −0.7 to −2.6% on Qwen up to 1K. (b) Persistent memory ratio: 3–3.3× on full-attention layers (Qwen, Gemma full-attenti… view at source ↗
Figure 2
Figure 2. Figure 2: ∆PPL vs KV-cache bit width on SmolLM2-{135M, 360M, 1.7B} with per-token scaling. Error bars are ±1 seed standard deviation (three seeds for 135M / 360M; one seed for 1.7B). SRHT and SRFT curves lie within each other’s error bars at every bit width; both cut identity degradation 5–6× at 4-bit and ∼15× at 3-bit. Robust across GQA (135M, 360M) and MHA (1.7B). Quantization. Uniform symmetric quantization at bi… view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end decode latency (ms/token, lower is better) for 256-token prompt + 128 new tokens, averaged over five iterations. Eager-mode SRFT pays a consistent 12–17% dispatch tax vs SRHT because the PyTorch/MPS pipeline dispatches four separate kernels per layer per step. The fused Metal kernel closes that gap: on SmolLM2-360M it matches SRHT within 1%, on SmolLM2-1.7B it is 5% faster than SRHT because each… view at source ↗
Figure 4
Figure 4. Figure 4: Metal kernel throughput as ns per vector vs batch size, for D=64 / D=128 and int8 / int4 output. Int4 and int8 variants track each other to within 3% because the butterfly FLOPs dominate the runtime; halving the store bandwidth only shaves the bandwidth-bound tail at very large batches. • D=128, int4: 20.1 ns/vec, 223 GFLOPS, 28.9 GB/s The D=128 kernel achieves higher GFLOPS than D=64 because it saturates … view at source ↗
Figure 5
Figure 5. Figure 5: Left: raw GFLOPS for a length-8 complex DFT expressed as a 16 × 16 real matrix multiply, col-batched vs row-batched, compared to the AMX FP32 single-core peak (∼ 350 GFLOPS). Right: same configurations in ns per DFT, vs Apple’s tuned vDSP fft zop. Col-batched sgemm beats vDSP by 3.4× at B=1024 but loses at B ≥ 262 K when the output no longer fits in L2. The learned λ fuses into the Metal kernel as an in-re… view at source ↗
read the original abstract

KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $\lambda$ $+$ per-group abs-max $+$ int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, runs \emph{faster than fp16} across $256$--$4096$-token prefixes on Gemma-3 1B ($-3$ to $-8\%$ ms/tok) and at short context on Qwen2.5-1.5B ($-0.7$ to $-2.6\%$ through $1$K), with $3\times$ persistent memory compression and quality preserved ($\dPPL = 0.000$ Qwen short-prompt; $+3.6$ hook $\dPPL$ Gemma). The kernel's $\sim\!25$\,ns/vec overhead is below the bandwidth savings from $3\times$ compression. The fused kernel also closes Qwen's 4-bit per-token catastrophe ($\dPPL = +7975 \to +638.6$, $12.5\times$ reduction) at $182$\,GFLOPS / $D{=}128$. Supporting findings: $\SRFT$ and $\SRHT$ are statistically indistinguishable for KV quality (we pick $\SRFT$ for mixed-radix and matrix-multiply alignment); a learned-rotation ablation surfaces a regularization role for the fixed random SRFT base (learning $R+\lambda$ without SRFT lowers calibration MSE $84.9\%$ vs $50.3\%$ but yields worse PPL); Householder rotations at $k{=}d/2$ reflectors are effectively lossless at $d{=}256$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that on Apple Silicon's unified memory, KV-cache quantization can invert the usual quality-latency trade-off. A single fused Metal kernel (sign-randomized FFT + per-channel λ + per-group abs-max + int4 nibble packing), exposed via a Hugging Face Cache subclass, reportedly runs faster than fp16 KV cache (–3 to –8 % ms/tok on Gemma-3 1B for 256–4096-token prefixes; –0.7 to –2.6 % on Qwen2.5-1.5B at ≤1 K tokens) while delivering 3× persistent memory compression and near-lossless quality (ΔPPL = 0.000 on Qwen short prompts; +3.6 hook ΔPPL on Gemma). The kernel's ~25 ns/vec overhead is asserted to lie below the bandwidth savings from 3× compression. Supporting results include statistical equivalence of SRFT and SRHT, a regularization role for the fixed random SRFT base, and near-lossless Householder rotations at k = d/2.

Significance. If the timing and quality results prove robust, the work would be significant for systems-level LLM inference on Apple Silicon and similar unified-memory platforms. It supplies concrete evidence that a carefully fused, hardware-specific quantization kernel can turn a presumed trade-off into a net win, with immediate practical value for memory-constrained deployments and a clear path for reproduction via the provided Hugging Face integration.

major comments (2)
  1. [Abstract] The central claim that the fused kernel's ~25 ns/vec overhead remains below 3× bandwidth savings (and therefore produces net speedups) is load-bearing yet rests solely on direct measurements for the specific context lengths, batch sizes, and models tested. No timing breakdown, roofline analysis, or sensitivity study across wider context ranges or batch sizes is supplied to show the overhead stays sub-linear outside the reported 256–4096 token window.
  2. [Abstract] The reported speedups (–3 to –8 % ms/tok on Gemma-3 1B) and perplexity deltas lack error bars, raw timing data, or statistical significance tests. Without these, it is impossible to determine whether the observed gains exceed measurement noise or hardware variability on the tested Apple Silicon devices.
minor comments (2)
  1. [Abstract] The abstract states both SRFT and SRHT are “statistically indistinguishable” for KV quality; the manuscript should report the exact statistical test and p-value used to reach this conclusion.
  2. [Abstract] The learned-rotation ablation reports calibration MSE values (84.9 % vs 50.3 %) but does not clarify whether these are relative or absolute reductions; a short clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional analysis would strengthen the manuscript. We address each major comment below and will incorporate the suggested improvements in the revised version.

read point-by-point responses
  1. Referee: [Abstract] The central claim that the fused kernel's ~25 ns/vec overhead remains below 3× bandwidth savings (and therefore produces net speedups) is load-bearing yet rests solely on direct measurements for the specific context lengths, batch sizes, and models tested. No timing breakdown, roofline analysis, or sensitivity study across wider context ranges or batch sizes is supplied to show the overhead stays sub-linear outside the reported 256–4096 token window.

    Authors: We agree that a roofline analysis and broader sensitivity study would provide stronger support for the general claim. In the revised manuscript we will add a roofline plot for the Apple Silicon GPU that positions the fused kernel relative to the memory-bandwidth roof, together with a per-component timing breakdown that isolates the ~25 ns/vec quantization overhead from the bandwidth reduction. We will also extend the latency measurements to context lengths up to 8192 tokens and batch sizes 1–8, confirming that the overhead remains below the 3× compression benefit across this wider operating range. revision: yes

  2. Referee: [Abstract] The reported speedups (–3 to –8 % ms/tok on Gemma-3 1B) and perplexity deltas lack error bars, raw timing data, or statistical significance tests. Without these, it is impossible to determine whether the observed gains exceed measurement noise or hardware variability on the tested Apple Silicon devices.

    Authors: We acknowledge that the absence of error bars and statistical tests limits the ability to assess robustness against measurement variability. In the revision we will report means and standard deviations from at least ten independent runs for every latency and perplexity configuration. We will also include paired statistical significance tests (Wilcoxon signed-rank) comparing the int4 and fp16 timings, and we will release the raw timing logs as supplementary material so that readers can reproduce the analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical timing and ablation results only

full rationale

The paper reports direct hardware measurements of a fused Metal kernel's latency and quality metrics (ΔPPL) on specific models and context lengths, together with ablation comparisons (SRFT vs SRHT, learned-rotation effects). No derivation chain, equations, or first-principles predictions are presented that could reduce to fitted inputs or self-citations by construction. All claims rest on experimental data rather than any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work appears to rest on standard transformer KV-cache mechanics and empirical timing on specific hardware.

pith-pipeline@v0.9.0 · 5642 in / 1181 out tokens · 69834 ms · 2026-05-08T03:11:08.831406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 18 canonical work pages · 8 internal anchors

  1. [1]

    Turboquant: Online vector quantization with near-optimal distortion rate,

    Amir Abbasi et al. TurboQuant: Online vector quantization with near-optimal distortion rate. arXiv:2504.19874, 2025

  2. [2]

    The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

    Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

  3. [3]

    KurTail: Kurtosis-based LLM quantization

    Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, and Martino Dazzi. KurTail: Kurtosis-based LLM quantization. arXiv:2503.01483, 2025

  4. [4]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Loubna Ben Allal et al. SmolLM2: When Smol goes big — data-centric training of a small language model. arXiv:2502.02737, 2025

  5. [5]

    Accelerate framework documentation.https://developer.apple.com/ documentation/accelerate

    Apple Inc. Accelerate framework documentation.https://developer.apple.com/ documentation/accelerate

  6. [6]

    Metal shading language specification.https://developer.apple.com/metal/, 2024

    Apple Inc. Metal shading language specification.https://developer.apple.com/metal/, 2024

  7. [7]

    & Hooker, S

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. arXiv:2404.00456, 2024

  8. [8]

    Cosmopedia

    Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra. Cosmopedia. HuggingFace Datasets, 2024

  9. [9]

    Beating vDSP: A 138 GFLOPS radix-8 Stockham FFT on Apple Silicon via two-tier register-threadgroup memory decomposition

    Mohamed Amine Bergach. Beating vDSP: A 138 GFLOPS radix-8 Stockham FFT on Apple Silicon via two-tier register-threadgroup memory decomposition. 2026

  10. [10]

    Apple AMX instruction set (reverse engineered).https://github.com/corsix/ amx, 2022

    Peter Cawley. Apple AMX instruction set (reverse engineered).https://github.com/corsix/ amx, 2022

  11. [11]

    arXiv preprint arXiv:2307.13304 , year=

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. arXiv:2307.13304, 2023. 18

  12. [12]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

  13. [13]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv:2208.07339, 2022

  14. [14]

    The Llama 3 Herd of Models

    Abhimanyu Dubey et al. The Llama 3 herd of models. arXiv:2407.21783, 2024

  15. [15]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. arXiv:2210.17323, 2023

  16. [16]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report. arXiv:2503.19786, 2025

  17. [17]

    Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011

  18. [18]

    Apple AMX notes.https://dougallj.wordpress.com/2021/03/09/ apples-mysterious-coprocessor/, 2021

    Dougall Johnson. Apple AMX notes.https://dougallj.wordpress.com/2021/03/09/ apples-mysterious-coprocessor/, 2021

  19. [19]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv:2306.00978, 2023

  20. [20]

    Beyond homogeneous attention: Memory-efficient LLMs via Fourier-approximated KV cache

    Xiaoran Liu, Siyang He, Qiqi Wang, et al. Beyond homogeneous attention: Memory-efficient LLMs via Fourier-approximated KV cache. arXiv:2506.11886, 2025

  21. [21]

    Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. arXiv:2405.16406, 2024

  22. [22]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2-bit quantization for KV cache. arXiv:2402.02750, 2024

  23. [23]

    RotorQuant: KV cache compression via block-diagonal rotation.https://github

    Scrya. RotorQuant: KV cache compression via block-diagonal rotation.https://github. com/scrya-com/rotorquant, 2025

  24. [24]

    DartQuant: Efficient rotational distribution calibration for LLM quantization

    Wenqi Shao, Yuhang Chen, et al. DartQuant: Efficient rotational distribution calibration for LLM quantization. arXiv:2511.04063, 2025

  25. [25]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. arXiv:2402.04396, 2024. 19

  26. [26]

    FPTQuant: Function-preserving transforms for LLM quantization

    Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, and Markus Nagel. FPTQuant: Function-preserving transforms for LLM quantization. arXiv:2506.04985, 2025

  27. [27]

    Transformers: State- of-the-art natural language processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. Transformers: State- of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020

  28. [28]

    A fast randomized algorithm for the approximation of matrices.Applied and Computational Harmonic Analysis, 25(3):335– 366, 2008

    Franco Woolfe, Edo Liberty, Vladimir Rokhlin, and Mark Tygert. A fast randomized algorithm for the approximation of matrices.Applied and Computational Harmonic Analysis, 25(3):335– 366, 2008

  29. [29]

    ButterflyQuant: Ultra-low- bit LLM quantization through learnable orthogonal butterfly transforms

    Bingxin Xu, Zhen Dong, Oussama Elachqar, and Yuzhang Shang. ButterflyQuant: Ultra-low- bit LLM quantization through learnable orthogonal butterfly transforms. arXiv:2509.09679, 2025

  30. [30]

    Qwen2 Technical Report

    An Yang et al. Qwen2 technical report. arXiv:2407.10671, 2024. 20