KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 2years
2026 2roles
background 1polarities
background 1representative citing papers
HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.
citing papers explorer
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.
-
HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models
HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.