arxiv: 2604.04722 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni , Niloufar Mehrabi , Patrick Woods , Gabriel Hillesheim , Abolfazl Razi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords KV-cache quantizationadaptive quantizationon-device LLMstoken importancevariable precisiondecoding latencylightweight modelsinference optimization

0 comments

The pith

A learned controller assigns variable bit widths to KV-cache tokens using token features, cutting latency 18% while staying within 0.3 points of FP16 accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fixed bit-width KV-cache quantization wastes precision on low-impact tokens and over-compresses important ones, degrading the accuracy-latency tradeoff for on-device LLMs. Instead, a small controller extracts simple per-token signals such as frequency, quality score, attention variance, and entropy, then picks the right precision from 2-bit, 4-bit, 8-bit, or FP16 for each entry during decoding. This variable allocation follows the same logic as Huffman coding but learned from data, so memory and bandwidth drop without the accuracy penalty seen in static schemes. Readers care because the KV cache dominates memory use and generation speed once context grows, directly limiting what small models can run locally on phones and embedded hardware.

Core claim

A compact data-driven controller maps lightweight token-level features to dynamic KV-cache bit-width choices from the set {2-bit, 4-bit, 8-bit, FP16}, producing lower memory footprint and decoding latency than static quantization while preserving accuracy competitive with FP16 on commonsense reasoning benchmarks for SmolLM models ranging from 135M to 1.7B parameters.

What carries the argument

The compact data-driven controller that receives token frequency, quality score, attention variance, and entropy-based uncertainty and outputs a per-token bit-width decision during autoregressive decoding.

Load-bearing premise

The chosen lightweight token features are sufficient to predict the bit width that keeps accuracy loss small across contexts and tasks.

What would settle it

If the adaptive controller produces no latency reduction and at least 1 point lower accuracy than static 4-bit quantization on a held-out benchmark such as MMLU for any of the tested model sizes, the claimed benefit is refuted.

Figures

Figures reproduced from arXiv: 2604.04722 by Abolfazl Razi, Gabriel Hillesheim, Niloufar Mehrabi, Patrick Woods, Sayed Pedram Haeri Boroujeni.

**Figure 1.** Figure 1: Overview of the proposed framework: We introduce a data-driven controller for adaptive KV-cache quantization to address the KV-cache memory bottleneck in on-device LLM inference, where static quantization often degrades reasoning quality. Our method extracts lightweight token-level signals (e.g., token frequency, attention variance, and entropy-based uncertainty) and uses a learned MLP controller to assign… view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical learned controller for per-token KV bit widths using four simple features, but the lack of training details and validation leaves the reported gains on shaky ground.

read the letter

The main point is a compact data-driven controller that picks KV cache precision token by token from a small set of features: frequency, quality score, attention variance, and entropy uncertainty. It aims to cut memory and latency on edge devices without much accuracy loss compared to static quantization or full precision. The approach is straightforward and directly targets the linear growth of the KV cache during decoding, which is a real bottleneck for on-device LLMs. The reported numbers on SmolLM-360M with HellaSwag, a 17.75% latency reduction and 7.6-point accuracy gain over static quantization while staying close to FP16, would matter if they replicate. The controller design keeps overhead low, which fits the lightweight goal. The feature choices are reasonable starting points drawn from common token statistics. That said, the abstract and available details give almost nothing on how the controller was trained, what data or loss was used, or how the policy was checked for overfitting across contexts or models. There are no ablations on the features themselves or clear descriptions of the static and rule-based baselines. This makes it hard to tell whether the adaptivity is doing the work or if the results reflect careful tuning on the test sets. The stress-test concern about feature sufficiency lands because nothing shows the mapping from those four signals to bit width is reliable or generalizes. The work is aimed at engineers and researchers focused on efficient inference for mobile and embedded models. Someone already working on KV cache compression or quantization might find the controller architecture and feature list worth testing in their own setup. It deserves peer review because the problem is timely and the method is simple enough to evaluate once the methods and full results are filled in.

Referee Report

2 major / 1 minor

Summary. The paper proposes adaptive KV-cache quantization for on-device LLMs, where a compact learned controller assigns per-token bit-widths from {2,4,8,FP16} based on four lightweight features (token frequency, quality score, attention variance, entropy-based uncertainty). It claims this reduces memory/latency versus static quantization or rule-based baselines while preserving accuracy close to FP16, with experiments on SmolLM-135M/360M/1.7B models across commonsense reasoning benchmarks; a highlighted result is 17.75% lower decoding latency (ms/token) and +7.60 accuracy points on SmolLM-360M/HellaSwag, staying within 0.30 of FP16.

Significance. If the empirical results hold after proper validation, the work would be significant for practical on-device LLM deployment: it demonstrates that a data-driven, low-overhead controller can dynamically allocate KV precision according to token importance, improving the accuracy-latency trade-off over fixed-bit schemes without requiring heavy additional compute. The approach is lightweight enough for edge hardware and generalizes the Huffman-inspired variable allocation idea to KV caches.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central empirical claims (e.g., 17.75% latency reduction and +7.60 accuracy points on SmolLM-360M/HellaSwag, remaining within 0.30 of FP16) are presented without any description of controller training data, loss function, regularization, baseline implementations, statistical tests, or potential data exclusions. This leaves the reported gains only weakly supported and difficult to reproduce or compare.
[Method] Method section (controller description): the claim that the four lightweight features reliably predict the bit-width needed to preserve accuracy rests on an unverified assumption of sufficiency and generalization. No ablation studies, feature-importance analysis, or held-out context/model validation are reported, so it is unclear whether the policy avoids overfitting or simply wastes bits on average relative to a well-tuned static baseline.

minor comments (1)

[Method] The bit-width set {2,4,8,FP16} is introduced without explicit notation or a table summarizing the quantization scheme per precision level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen reproducibility and empirical validation.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central empirical claims (e.g., 17.75% latency reduction and +7.60 accuracy points on SmolLM-360M/HellaSwag, remaining within 0.30 of FP16) are presented without any description of controller training data, loss function, regularization, baseline implementations, statistical tests, or potential data exclusions. This leaves the reported gains only weakly supported and difficult to reproduce or compare.

Authors: We agree that the current manuscript lacks sufficient detail on these aspects for full reproducibility. In the revised version, we will expand the Experiments section to describe the controller training data, loss function, regularization, baseline implementations, any statistical tests, and data exclusions. This will allow direct comparison and replication of the reported accuracy-latency improvements. revision: yes
Referee: [Method] Method section (controller description): the claim that the four lightweight features reliably predict the bit-width needed to preserve accuracy rests on an unverified assumption of sufficiency and generalization. No ablation studies, feature-importance analysis, or held-out context/model validation are reported, so it is unclear whether the policy avoids overfitting or simply wastes bits on average relative to a well-tuned static baseline.

Authors: We acknowledge that additional validation is needed to confirm feature sufficiency and generalization. We will add ablation studies on the feature set, feature-importance analysis, and evaluations on held-out contexts and models in the revised manuscript. These will demonstrate that the policy generalizes without overfitting and improves upon well-tuned static baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; learned controller yields empirical results independent of inputs.

full rationale

The paper describes a data-driven adaptive KV-cache quantization policy that extracts four token-level features (frequency, quality score, attention variance, entropy-based uncertainty) and maps them via a compact controller to bit-width choices from {2,4,8,FP16}. No equations, derivations, or self-citations are presented that reduce the reported latency reductions (e.g., 17.75% on SmolLM-360M/HellaSwag) or accuracy gains to fitted parameters by construction, self-definition, or load-bearing prior work by the same authors. The controller is explicitly trained on data, with performance validated against static quantization and FP16 baselines on held-out benchmarks; the derivation chain therefore remains self-contained and falsifiable via external experiments rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that a small set of hand-chosen token features can proxy importance for quantization decisions, plus the existence of a trainable controller whose architecture and training objective are unspecified in the abstract. No invented entities are introduced.

free parameters (1)

bit-width set
Discrete choices {2-bit, 4-bit, 8-bit, FP16} are selected, presumably tuned to hardware and accuracy targets.

axioms (1)

domain assumption Lightweight token features (frequency, quality score, attention variance, entropy) suffice to estimate importance for bit-width assignment.
Invoked to justify the controller's input without requiring full model recomputation.

pith-pipeline@v0.9.0 · 5635 in / 1300 out tokens · 49001 ms · 2026-05-10T19:27:09.006353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Inspired by Huffman coding’s principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance... b*(x) = f(I(x))... R(D) = min I(X;X̂) s.t. distortion ≤ D
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method extracts lightweight token-level features... feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs
cs.AI 2026-05 unverdicted novelty 6.0

QKVShare enables efficient quantized KV-cache handoff for on-device multi-agent LLMs, cutting TTFT versus re-prefill across tested contexts while adaptive quantization stays competitive with uniform baselines on GSM8K.

Reference graph

Works this paper leans on

38 extracted references · 9 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems, 6:114–127, 2024

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J Nair, Ilya Soloveychik, and Purushotham Kamath. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems, 6:114–127, 2024. 2, 3

2024
[2]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInter- national conference on machine learning, pages 2397–2430. PMLR, 2023. 8

2023
[3]

All you need for object detection: From pixels, points, and prompts to next-gen fusion and multimodal llms/vlms in autonomous vehicles.Image and Vision Computing, page 105944, 2026

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Mahlagha Fazeli, and Abolfazl Razi. All you need for object detection: From pixels, points, and prompts to next-gen fusion and multimodal llms/vlms in autonomous vehicles.Image and Vision Computing, page 105944, 2026. 2

2026
[4]

Variable length markov chains.The Annals of Statistics, 27(2):480–513,

Peter B ¨uhlmann and Abraham J Wyner. Variable length markov chains.The Annals of Statistics, 27(2):480–513,
[5]

Qaq: Quality adaptive quantization for llm kv cache

Wen Cheng, Shichen Dong, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 2542–2550, 2025. 2, 3

2025
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208, 2023. 8

work page arXiv 2023
[8]

arXiv preprint arXiv:2407.11550 , year =

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024. 2, 3

work page arXiv 2024
[9]

Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023. 3

work page arXiv 2023
[10]

A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025

LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhan- chao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025. 2

2025
[11]

Zipcache: Accurate and ef- ficient kv cache quantization with salient token identifica- tion.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and ef- ficient kv cache quantization with salient token identifica- tion.Advances in Neural Information Processing Systems, 37:68287–68307, 2024. 2, 3

2024
[12]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024. 2, 3

2024
[13]

David A. Huffman. A method for the construction of minimum-redundancy codes.Proceedings of the IRE, 40(9): 1098–1101, 1952. 4

1952
[14]

A compre- hensive study on quantization techniques for large language models

Jiedong Lang, Zhehao Guo, and Shuyu Huang. A compre- hensive study on quantization techniques for large language models. In2024 4th International conference on artificial intelligence, robotics, and communication (ICAIRC), pages 224–231. IEEE, 2024. 2

2024
[15]

Kvtuner: Sensitivity-aware layer-wise mixed- precision kv cache quantization for efficient and nearly loss- less llm inference

Xing Li, XING Zeyu, Yiming Li, Linping Qu, Hui-Ling Zhen, Yiwu Yao, Wulong Liu, Sinno Jialin Pan, and Mingx- uan Yuan. Kvtuner: Sensitivity-aware layer-wise mixed- precision kv cache quantization for efficient and nearly loss- less llm inference. InForty-second International Conference on Machine Learning. 2
[16]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Informa- tion Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Informa- tion Processing Systems, 37:22947–22970, 2024. 2, 3

2024
[17]

Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of machine learning and systems, 6:87– 100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of machine learning and systems, 6:87– 100, 2024. 3

2024
[18]

Minicache: Kv cache com- pression in depth dimension for large language models

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache com- pression in depth dimension for large language models. Advances in Neural Information Processing Systems, 37: 139997–140031, 2024. 2

2024
[19]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024. 2

work page internal anchor Pith review arXiv 2024
[20]

Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices

Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, et al. Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4145–4155, 2025. 2

2025
[21]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sab- harwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018. 7

2018
[22]

Resq: Mixed-precision quantization of large language mod- els with low-rank residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language mod- els with low-rank residuals. InInternational Conference on Machine Learning, pages 53095–53114. PMLR, 2025. 2

2025
[23]

A mathematical theory of commu- nication.The Bell system technical journal, 27(3):379–423,

Claude Elwood Shannon. A mathematical theory of commu- nication.The Bell system technical journal, 27(3):379–423,
[24]

Audio- visual llm for video understanding

Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio- visual llm for video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4246–4255, 2025. 2 9

2025
[25]

Cache me if you must: Adaptive key- value quantization for large language models

Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, De- nis Kuznedelev, Denis Mazur, Surkov Nikita, Ivan Ermakov, and Dan Alistarh. Cache me if you must: Adaptive key- value quantization for large language models. InInter- national Conference on Machine Learning, pages 55451– 55473. PMLR, 2025. 2

2025
[26]

Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts

Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 10810–10820, 2025. 2

2025
[27]

Cocktail: Chunk-adaptive mixed-precision quantization for long-context llm inference

Wei Tao, Bin Zhang, Xiaoyang Qu, Jiguang Wan, and Jian- zong Wang. Cocktail: Chunk-adaptive mixed-precision quantization for long-context llm inference. In2025 Design, Automation & Test in Europe Conference (DATE), pages 1–
[28]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poul- ton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022. 8

work page internal anchor Pith review arXiv 2022
[29]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

2017
[30]

Scope: Optimizing key-value cache compression in long-context generation

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yu- lan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 10775– 10790, 2025. 2

2025
[31]

Lamini-lm: A diverse herd of distilled models from large-scale instructions

Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. Lamini-lm: A diverse herd of distilled models from large-scale instructions. InPro- ceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 944–964, 2024. 8

2024
[32]

Empowering llms to understand and generate complex vector graphics

Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, and Qian Yu. Empowering llms to understand and generate complex vector graphics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19487–19497, 2025. 2

2025
[33]

Seqafford: Sequential 3d affordance reasoning via multimodal large language model

Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, and Jingya Wang. Seqafford: Sequential 3d affordance reasoning via multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1691–1701, 2025. 2

2025
[34]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800,

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800,
[35]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans- former language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review arXiv
[36]

H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els.Advances in Neural Information Processing Systems, 36: 34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els.Advances in Neural Information Processing Systems, 36: 34661–34710, 2023. 2, 3

2023
[37]

Dynamickv: Task-aware adaptive kv cache compression for long context llms

Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, and Liang Ding. Dynam- ickv: Task-aware adaptive kv cache compression for long context llms.arXiv preprint arXiv:2412.14838, 2024. 2

work page arXiv 2024
[38]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024. 2 10

work page internal anchor Pith review arXiv 2024