GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
Llm.int8(): 8-bit matrix multiplication for transformers at scale
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Dynamic int8 quantization via Quanto on Whisper-small reduces size by 57% and improves WER on LibriSpeech test sets compared to the unquantized baseline.
citing papers explorer
-
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
-
Quantizing Whisper-small: How design choices affect ASR performance
Dynamic int8 quantization via Quanto on Whisper-small reduces size by 57% and improves WER on LibriSpeech test sets compared to the unquantized baseline.