pith. machine review for the scientific record. sign in

arxiv: 2604.04722 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords KV-cache quantizationadaptive quantizationon-device LLMstoken importancevariable precisiondecoding latencylightweight modelsinference optimization
0
0 comments X

The pith

A learned controller assigns variable bit widths to KV-cache tokens using token features, cutting latency 18% while staying within 0.3 points of FP16 accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fixed bit-width KV-cache quantization wastes precision on low-impact tokens and over-compresses important ones, degrading the accuracy-latency tradeoff for on-device LLMs. Instead, a small controller extracts simple per-token signals such as frequency, quality score, attention variance, and entropy, then picks the right precision from 2-bit, 4-bit, 8-bit, or FP16 for each entry during decoding. This variable allocation follows the same logic as Huffman coding but learned from data, so memory and bandwidth drop without the accuracy penalty seen in static schemes. Readers care because the KV cache dominates memory use and generation speed once context grows, directly limiting what small models can run locally on phones and embedded hardware.

Core claim

A compact data-driven controller maps lightweight token-level features to dynamic KV-cache bit-width choices from the set {2-bit, 4-bit, 8-bit, FP16}, producing lower memory footprint and decoding latency than static quantization while preserving accuracy competitive with FP16 on commonsense reasoning benchmarks for SmolLM models ranging from 135M to 1.7B parameters.

What carries the argument

The compact data-driven controller that receives token frequency, quality score, attention variance, and entropy-based uncertainty and outputs a per-token bit-width decision during autoregressive decoding.

Load-bearing premise

The chosen lightweight token features are sufficient to predict the bit width that keeps accuracy loss small across contexts and tasks.

What would settle it

If the adaptive controller produces no latency reduction and at least 1 point lower accuracy than static 4-bit quantization on a held-out benchmark such as MMLU for any of the tested model sizes, the claimed benefit is refuted.

Figures

Figures reproduced from arXiv: 2604.04722 by Abolfazl Razi, Gabriel Hillesheim, Niloufar Mehrabi, Patrick Woods, Sayed Pedram Haeri Boroujeni.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework: We introduce a data-driven controller for adaptive KV-cache quantization to address the KV-cache memory bottleneck in on-device LLM inference, where static quantization often degrades reasoning quality. Our method extracts lightweight token-level signals (e.g., token frequency, attention variance, and entropy-based uncertainty) and uses a learned MLP controller to assign… view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes adaptive KV-cache quantization for on-device LLMs, where a compact learned controller assigns per-token bit-widths from {2,4,8,FP16} based on four lightweight features (token frequency, quality score, attention variance, entropy-based uncertainty). It claims this reduces memory/latency versus static quantization or rule-based baselines while preserving accuracy close to FP16, with experiments on SmolLM-135M/360M/1.7B models across commonsense reasoning benchmarks; a highlighted result is 17.75% lower decoding latency (ms/token) and +7.60 accuracy points on SmolLM-360M/HellaSwag, staying within 0.30 of FP16.

Significance. If the empirical results hold after proper validation, the work would be significant for practical on-device LLM deployment: it demonstrates that a data-driven, low-overhead controller can dynamically allocate KV precision according to token importance, improving the accuracy-latency trade-off over fixed-bit schemes without requiring heavy additional compute. The approach is lightweight enough for edge hardware and generalizes the Huffman-inspired variable allocation idea to KV caches.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the central empirical claims (e.g., 17.75% latency reduction and +7.60 accuracy points on SmolLM-360M/HellaSwag, remaining within 0.30 of FP16) are presented without any description of controller training data, loss function, regularization, baseline implementations, statistical tests, or potential data exclusions. This leaves the reported gains only weakly supported and difficult to reproduce or compare.
  2. [Method] Method section (controller description): the claim that the four lightweight features reliably predict the bit-width needed to preserve accuracy rests on an unverified assumption of sufficiency and generalization. No ablation studies, feature-importance analysis, or held-out context/model validation are reported, so it is unclear whether the policy avoids overfitting or simply wastes bits on average relative to a well-tuned static baseline.
minor comments (1)
  1. [Method] The bit-width set {2,4,8,FP16} is introduced without explicit notation or a table summarizing the quantization scheme per precision level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen reproducibility and empirical validation.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central empirical claims (e.g., 17.75% latency reduction and +7.60 accuracy points on SmolLM-360M/HellaSwag, remaining within 0.30 of FP16) are presented without any description of controller training data, loss function, regularization, baseline implementations, statistical tests, or potential data exclusions. This leaves the reported gains only weakly supported and difficult to reproduce or compare.

    Authors: We agree that the current manuscript lacks sufficient detail on these aspects for full reproducibility. In the revised version, we will expand the Experiments section to describe the controller training data, loss function, regularization, baseline implementations, any statistical tests, and data exclusions. This will allow direct comparison and replication of the reported accuracy-latency improvements. revision: yes

  2. Referee: [Method] Method section (controller description): the claim that the four lightweight features reliably predict the bit-width needed to preserve accuracy rests on an unverified assumption of sufficiency and generalization. No ablation studies, feature-importance analysis, or held-out context/model validation are reported, so it is unclear whether the policy avoids overfitting or simply wastes bits on average relative to a well-tuned static baseline.

    Authors: We acknowledge that additional validation is needed to confirm feature sufficiency and generalization. We will add ablation studies on the feature set, feature-importance analysis, and evaluations on held-out contexts and models in the revised manuscript. These will demonstrate that the policy generalizes without overfitting and improves upon well-tuned static baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; learned controller yields empirical results independent of inputs.

full rationale

The paper describes a data-driven adaptive KV-cache quantization policy that extracts four token-level features (frequency, quality score, attention variance, entropy-based uncertainty) and maps them via a compact controller to bit-width choices from {2,4,8,FP16}. No equations, derivations, or self-citations are presented that reduce the reported latency reductions (e.g., 17.75% on SmolLM-360M/HellaSwag) or accuracy gains to fitted parameters by construction, self-definition, or load-bearing prior work by the same authors. The controller is explicitly trained on data, with performance validated against static quantization and FP16 baselines on held-out benchmarks; the derivation chain therefore remains self-contained and falsifiable via external experiments rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that a small set of hand-chosen token features can proxy importance for quantization decisions, plus the existence of a trainable controller whose architecture and training objective are unspecified in the abstract. No invented entities are introduced.

free parameters (1)
  • bit-width set
    Discrete choices {2-bit, 4-bit, 8-bit, FP16} are selected, presumably tuned to hardware and accuracy targets.
axioms (1)
  • domain assumption Lightweight token features (frequency, quality score, attention variance, entropy) suffice to estimate importance for bit-width assignment.
    Invoked to justify the controller's input without requiring full model recomputation.

pith-pipeline@v0.9.0 · 5635 in / 1300 out tokens · 49001 ms · 2026-05-10T19:27:09.006353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Inspired by Huffman coding’s principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance... b*(x) = f(I(x))... R(D) = min I(X;X̂) s.t. distortion ≤ D

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our method extracts lightweight token-level features... feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16}

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    QKVShare enables efficient quantized KV-cache handoff for on-device multi-agent LLMs, cutting TTFT versus re-prefill across tested contexts while adaptive quantization stays competitive with uniform baselines on GSM8K.

Reference graph

Works this paper leans on

38 extracted references · 9 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems, 6:114–127, 2024

    Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J Nair, Ilya Soloveychik, and Purushotham Kamath. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems, 6:114–127, 2024. 2, 3

  2. [2]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInter- national conference on machine learning, pages 2397–2430. PMLR, 2023. 8

  3. [3]

    All you need for object detection: From pixels, points, and prompts to next-gen fusion and multimodal llms/vlms in autonomous vehicles.Image and Vision Computing, page 105944, 2026

    Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Mahlagha Fazeli, and Abolfazl Razi. All you need for object detection: From pixels, points, and prompts to next-gen fusion and multimodal llms/vlms in autonomous vehicles.Image and Vision Computing, page 105944, 2026. 2

  4. [4]

    Variable length markov chains.The Annals of Statistics, 27(2):480–513,

    Peter B ¨uhlmann and Abraham J Wyner. Variable length markov chains.The Annals of Statistics, 27(2):480–513,

  5. [5]

    Qaq: Quality adaptive quantization for llm kv cache

    Wen Cheng, Shichen Dong, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 2542–2550, 2025. 2, 3

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  7. [7]

    Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

    Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023. 8

  8. [8]

    arXiv preprint arXiv:2407.11550 , year =

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024. 2, 3

  9. [9]

    Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023. 3

  10. [10]

    A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025

    LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhan- chao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025. 2

  11. [11]

    Zipcache: Accurate and ef- ficient kv cache quantization with salient token identifica- tion.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

    Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and ef- ficient kv cache quantization with salient token identifica- tion.Advances in Neural Information Processing Systems, 37:68287–68307, 2024. 2, 3

  12. [12]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024. 2, 3

  13. [13]

    David A. Huffman. A method for the construction of minimum-redundancy codes.Proceedings of the IRE, 40(9): 1098–1101, 1952. 4

  14. [14]

    A compre- hensive study on quantization techniques for large language models

    Jiedong Lang, Zhehao Guo, and Shuyu Huang. A compre- hensive study on quantization techniques for large language models. In2024 4th International conference on artificial intelligence, robotics, and communication (ICAIRC), pages 224–231. IEEE, 2024. 2

  15. [15]

    Kvtuner: Sensitivity-aware layer-wise mixed- precision kv cache quantization for efficient and nearly loss- less llm inference

    Xing Li, XING Zeyu, Yiming Li, Linping Qu, Hui-Ling Zhen, Yiwu Yao, Wulong Liu, Sinno Jialin Pan, and Mingx- uan Yuan. Kvtuner: Sensitivity-aware layer-wise mixed- precision kv cache quantization for efficient and nearly loss- less llm inference. InForty-second International Conference on Machine Learning. 2

  16. [16]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Informa- tion Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Informa- tion Processing Systems, 37:22947–22970, 2024. 2, 3

  17. [17]

    Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of machine learning and systems, 6:87– 100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of machine learning and systems, 6:87– 100, 2024. 3

  18. [18]

    Minicache: Kv cache com- pression in depth dimension for large language models

    Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache com- pression in depth dimension for large language models. Advances in Neural Information Processing Systems, 37: 139997–140031, 2024. 2

  19. [19]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024. 2

  20. [20]

    Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices

    Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, et al. Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4145–4155, 2025. 2

  21. [21]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sab- harwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018. 7

  22. [22]

    Resq: Mixed-precision quantization of large language mod- els with low-rank residuals

    Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language mod- els with low-rank residuals. InInternational Conference on Machine Learning, pages 53095–53114. PMLR, 2025. 2

  23. [23]

    A mathematical theory of commu- nication.The Bell system technical journal, 27(3):379–423,

    Claude Elwood Shannon. A mathematical theory of commu- nication.The Bell system technical journal, 27(3):379–423,

  24. [24]

    Audio- visual llm for video understanding

    Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio- visual llm for video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4246–4255, 2025. 2 9

  25. [25]

    Cache me if you must: Adaptive key- value quantization for large language models

    Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, De- nis Kuznedelev, Denis Mazur, Surkov Nikita, Ivan Ermakov, and Dan Alistarh. Cache me if you must: Adaptive key- value quantization for large language models. InInter- national Conference on Machine Learning, pages 55451– 55473. PMLR, 2025. 2

  26. [26]

    Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts

    Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 10810–10820, 2025. 2

  27. [27]

    Cocktail: Chunk-adaptive mixed-precision quantization for long-context llm inference

    Wei Tao, Bin Zhang, Xiaoyang Qu, Jiguang Wan, and Jian- zong Wang. Cocktail: Chunk-adaptive mixed-precision quantization for long-context llm inference. In2025 Design, Automation & Test in Europe Conference (DATE), pages 1–

  28. [28]

    Galactica: A Large Language Model for Science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poul- ton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022. 8

  29. [29]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  30. [30]

    Scope: Optimizing key-value cache compression in long-context generation

    Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yu- lan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 10775– 10790, 2025. 2

  31. [31]

    Lamini-lm: A diverse herd of distilled models from large-scale instructions

    Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. Lamini-lm: A diverse herd of distilled models from large-scale instructions. InPro- ceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 944–964, 2024. 8

  32. [32]

    Empowering llms to understand and generate complex vector graphics

    Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, and Qian Yu. Empowering llms to understand and generate complex vector graphics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19487–19497, 2025. 2

  33. [33]

    Seqafford: Sequential 3d affordance reasoning via multimodal large language model

    Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, and Jingya Wang. Seqafford: Sequential 3d affordance reasoning via multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1691–1701, 2025. 2

  34. [34]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800,

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800,

  35. [35]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans- former language models.arXiv preprint arXiv:2205.01068,

  36. [36]

    H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els.Advances in Neural Information Processing Systems, 36: 34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els.Advances in Neural Information Processing Systems, 36: 34661–34710, 2023. 2, 3

  37. [37]

    Dynamickv: Task-aware adaptive kv cache compression for long context llms

    Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, and Liang Ding. Dynam- ickv: Task-aware adaptive kv cache compression for long context llms.arXiv preprint arXiv:2412.14838, 2024. 2

  38. [38]

    A Survey on Efficient Inference for Large Language Models

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024. 2 10