pith. machine review for the scientific record. sign in

arxiv: 2604.21231 · v2 · submitted 2026-04-23 · 💻 cs.NI · cs.AI· cs.PF

Recognition: unknown

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:04 UTC · model grok-4.3

classification 💻 cs.NI cs.AIcs.PF
keywords on-device LLM inferenceKV cache loadingadaptive streamingcloud-edge collaborationtime-to-first-tokenenergy efficiencyruntime schedule refinement
0
0 comments X

The pith

SparKV models per-chunk KV cache costs to decide between cloud streaming and local computation on device, overlapping the paths and refining schedules at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how on-device LLM inference can become faster and less energy-hungry by treating the key-value cache not as one big block but as individual chunks whose loading cost can be predicted. SparKV builds an offline cost model, then at runtime chooses for each chunk whether to pull it over the wireless link or recompute it locally while the two paths run in parallel. It further tweaks the plan on the fly when signal strength or device load changes. A sympathetic reader would care because the prefill stage is currently the main bottleneck that keeps large models from feeling responsive on phones and laptops. If the approach holds, users could run capable models locally with shorter waits for the first token and lower battery drain even on imperfect networks.

Core claim

SparKV is an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. It models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV refines offline-generated schedules at runtime to rebalance communication and computation costs, delivering 1.3x-5.1x lower time-to-first-token and 1.5x-3.3x lower energy per request with negligible effect on output quality.

What carries the argument

The per-chunk cost model and runtime decision engine that selects streaming versus local recomputation while overlapping execution and refining schedules on the fly.

If this is right

  • Time-to-first-token drops across multiple LLMs and edge hardware platforms.
  • Energy consumption per request decreases while response quality remains essentially unchanged.
  • Dynamic rebalancing keeps performance stable when network conditions fluctuate.
  • Both communication and local computation are used only where each is cheaper according to the current cost model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chunk-level cost modeling could be applied to other hybrid cloud-edge workloads such as on-device image or video generation.
  • Combining SparKV with existing model compression or quantization would likely compound the latency and energy gains.
  • Longer-running conversations might benefit from carrying forward refined cost estimates across multiple requests.
  • The approach suggests a general pattern for any system that can choose between fetching precomputed state or regenerating it locally.

Load-bearing premise

Offline-generated cost models for KV chunks stay accurate enough after runtime refinement to produce reliable choices across changing wireless conditions and device loads without adding significant overhead.

What would settle it

Measure time-to-first-token and energy on the same devices and models while deliberately varying wireless bandwidth and CPU availability; if the observed speedups fall below 1.3x or the refinement step adds measurable latency, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.21231 by Hongyao Liu, Junyi Wang, Liuqun Zhai, Zhengru Fang.

Figure 1
Figure 1. Figure 1: KV cache loading strategies for TTFT reduction: (a) stream￾ing only; (b) computation only; (c) overhead-aware hybrid loading. as the dominant contributor to Time-to-First-Token (TTFT). We therefore use TTFT [13]–[15] as the primary metric. For example, a mobile AI agent driven by openClaw [16] may need to retrieve and analyze a high-definition video from a user’s cloud gallery while preserving data privacy… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of attention sparsity across four representative heads from Qwen3-4B (upper row) and Qwen3-VL-8B (lower row). 0 1 2 Time (ms) 0-2047 tokens 2048-4095 tokens 4096-6143 tokens 6144-8191 tokens 0 1 2 Time (ms) 0 200 400 600 800 1000 Segment index 0 1 2 Time (ms) view at source ↗
Figure 3
Figure 3. Figure 3: Chunk-level computation latency of sparse attention for three samples from TriviaQA. Overall, wireless KV streaming consistently achieves lower TTFT and substantially lower energy consumption than pure local prefill, and its relative advantage generally becomes more pronounced as the reusable context grows. For example, wireless streaming reduces TTFT and energy by 2.2× and 28× when processing 24K-context … view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of entropy and code size of KV cache chunks in Qwen3-4B on TriviaQA and VideoMME. 1 2 3 4 5 6 7 8 9 10 Chunk id (1024 tokens per chunk) 0 100 200 Stream time (ms) 1 2 3 4 5 6 7 8 9 10 Chunk id (1024 tokens per chunk) 0 250 500 750 1000 Prefill time (ms) RTX 5080 Mobile Jetson AGX 64 GB Jetson Orin 16 GB view at source ↗
Figure 6
Figure 6. Figure 6: High-level architecture of SparKV. 𝑙𝑙 … Chunk 1 Chunk 𝑡𝑡 − 1 Chunk 𝑡𝑡 𝑙𝑙 − 1 Req 1: historical KV chunks Req 2: 𝑌𝑌𝑙𝑙−1 𝑡𝑡 Chunk 1 Chunk 𝑡𝑡 − 1 Chunk 𝑡𝑡 L1 … Req: historical KV chunks Chunk 𝑡𝑡 Req: 𝑌𝑌𝑙𝑙−1 𝑡𝑡 𝐿𝐿 𝐿𝐿 − 1 (a) (b) (c) 𝑌𝑌𝑙𝑙 𝑌𝑌 𝑡𝑡 1 𝑡𝑡 view at source ↗
Figure 7
Figure 7. Figure 7: Computation dependencies in (a) the first layer, (b) interior layers, and (c) the final layer. indexed as c = (t, l, h) ∈ C, where t ∈ [1, ⌈T /1024⌉] denotes the token-chunk index, l ∈ [1, L] the layer index, and h ∈ [1, H] the attention-head index. The schedule proceeds over K decision stages. At stage k, chunk c can either be streamed, computed locally, or left pending, represented by binary variables x … view at source ↗
Figure 8
Figure 8. Figure 8: Overhead and prediction error of the proposed predictor and the Roofline baseline for chunk computation latency estimation. analytical models can deviate substantially from measured latency on edge GPUs. We find that the latency variation in tattn(t, l, h) is primarily determined by three factors: sequence length, attention spar￾sity, and instantaneous device load. We therefore represent each non-final-lay… view at source ↗
Figure 9
Figure 9. Figure 9: Overall TTFT and response quality across datasets on an RTX 5080 laptop GPU with Llama-3.1-8B. • CacheGen [13] pre-encodes the KV cache into five bitrate levels using layer-wise quantization and arithmetic coding, and dynamically selects the bitrate according to the available bandwidth. We set its service-level objective to 2 s, follow￾ing prior interactive-application settings [50], [51]. • Strong Hybrid … view at source ↗
Figure 10
Figure 10. Figure 10: TTFT and response quality of SparKV and baselines on a Jetson AGX 64GB with Llama-3.1-8B. Performance across model families. We next evaluate SparKV across multiple LLMs and VLMs to test its robustness across model scales and modalities. LLMs view at source ↗
Figure 11
Figure 11. Figure 11: TTFT and response quality of SparKV and baselines on HotpotQA using a laptop GPU with Qwen3-4B and Qwen3-14B. 0 1 2 3 4 5 6 TTFT (s) 0.00 0.25 0.50 0.75 1.00 CDF Qwen2.5-VL-7B 0 1 2 3 4 5 6 TTFT (s) InternVL2-8B SparKV Strong Hybrid Cachegen Prefill SparKV Strong HybridCacheGen Prefill 0.0 0.2 0.4 Accuracy SparKV Strong HybridCacheGen Prefill view at source ↗
Figure 12
Figure 12. Figure 12: TTFT and response quality of SparKV and baselines on VideoMME using a laptop GPU across VLMs. SparKV sustains the lowest TTFT across all interference levels, achieving 1.4× and 1.6× speedups over Strong Hybrid and CacheGen, respectively, under severe congestion. This robustness comes from the runtime adaptation mechanism in section IV-D: when bandwidth fluctuates, CacheGen’s throughput-based bitrate selec… view at source ↗
Figure 14
Figure 14. Figure 14: Impact of concurrent requests. 10 18 24 32 38 Context Length (K) 0.0 2.5 5.0 TTFT (s) SparKV Strong Hybrid CacheGen Prefill 10 18 24 32 38 Context Length (K) 0 5 10 view at source ↗
Figure 16
Figure 16. Figure 16: Breakdown of streaming and computation overhead in SparKV on TriviaQA using an RTX 5080 laptop GPU. is to support multi-context KV reuse and blending, similar in spirit to CacheBlend [54], so that shared KV states can be reused or merged across related contexts. We leave this direction to future work. Extension to Mobile NPUs. SPARKV is guided by general scheduling principles and is not inherently tied to… view at source ↗
read the original abstract

Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SparKV, an adaptive framework for on-device LLM inference that decides per-KV-chunk whether to stream from the cloud or compute locally on the device. It overlaps the two paths to hide latency, uses offline-generated cost models refined at runtime to adapt to wireless and resource fluctuations, and reports experimental results showing TTFT reductions of 1.3x–5.1x and energy reductions of 1.5x–3.3x with negligible quality impact across datasets, models, and edge devices.

Significance. If the performance claims hold under rigorous evaluation, the work would be significant for practical on-device LLM deployment, as it directly targets the prefill-stage bottleneck with an overhead-aware hybrid cloud-edge design. The empirical hardware evaluation and focus on runtime adaptation to variable connectivity are strengths that could inform future systems work in this area.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (runtime refinement): the central TTFT and energy claims rest on the assumption that offline cost models, after runtime refinement, remain accurate and low-overhead under fluctuating wireless/edge conditions, yet no quantitative data on refinement overhead, convergence speed, or trace-driven variability is provided; without this, the overlapping-execution benefit cannot be verified.
  2. [§5] §5 (experimental evaluation): the reported speedups (1.3x–5.1x TTFT, 1.5x–3.3x energy) are stated without naming the exact baselines, number of runs, statistical significance tests, or error bars, and without describing the precise measurement methodology for TTFT and energy; this prevents assessment of whether the gains are robust or reproducible.
minor comments (2)
  1. [Abstract] The abstract mentions 'negligible impact on response quality' but does not specify the quality metric or threshold used; a brief clarification would improve readability.
  2. [§3] Notation for cost models and chunk decisions could be introduced earlier with a small diagram to aid readers unfamiliar with KV-cache streaming.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our work. We have carefully addressed each of the major comments in the revised manuscript and provide point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (runtime refinement): the central TTFT and energy claims rest on the assumption that offline cost models, after runtime refinement, remain accurate and low-overhead under fluctuating wireless/edge conditions, yet no quantitative data on refinement overhead, convergence speed, or trace-driven variability is provided; without this, the overlapping-execution benefit cannot be verified.

    Authors: We appreciate this comment. While the runtime refinement process is outlined in Section 4, including its use of recent measurements to adjust schedules, we agree that additional quantitative evidence on the overhead, convergence speed, and performance under trace-driven variability would strengthen the claims. In the revised manuscript, we have added a new analysis subsection with quantitative data from experiments and trace-driven simulations. This includes measurements showing low refinement overhead, rapid convergence, and maintained accuracy under varying wireless conditions. These results support the effectiveness of the overlapping execution. We have also updated the abstract to reflect these findings. revision: yes

  2. Referee: [§5] §5 (experimental evaluation): the reported speedups (1.3x–5.1x TTFT, 1.5x–3.3x energy) are stated without naming the exact baselines, number of runs, statistical significance tests, or error bars, and without describing the precise measurement methodology for TTFT and energy; this prevents assessment of whether the gains are robust or reproducible.

    Authors: We agree that more details are necessary for assessing robustness and reproducibility. The revised Section 5 now explicitly identifies the baselines (full on-device computation, full cloud KV streaming, and a non-adaptive hybrid approach). All experiments are conducted over 10 runs, with results reported as means accompanied by standard deviation error bars. We have included statistical significance testing using paired t-tests. Additionally, we have provided a detailed description of the TTFT measurement (using device timers from prompt submission to first token generation) and energy measurement (using on-device power monitoring APIs calibrated with external equipment). These revisions ensure the experimental claims are fully supported and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems design without derivation chain

full rationale

The paper describes an adaptive KV loading framework (SparKV) that uses offline cost models refined at runtime, with performance claims (TTFT and energy reductions) supported solely by hardware experiments across datasets, models, and devices. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or description. The contribution is a practical systems implementation and evaluation rather than a claimed derivation that could reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a systems and empirical contribution with no explicit mathematical axioms, free parameters, or invented physical entities; all claims rest on the engineering design and reported measurements.

pith-pipeline@v0.9.0 · 5482 in / 1146 out tokens · 40835 ms · 2026-05-08T14:04:28.362354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 21 canonical work pages · 11 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    M. L. Team, “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  2. [2]

    GPT-4 Technical Report

    OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. K. Aleman, D. Almeida, J. Altenschmidt, S. Altman,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team and Google, “Gemini: A family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  4. [4]

    Qwen3 Technical Report

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

  5. [5]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P.-A. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023

  6. [6]

    Fast on-device llm inference with npus,

    D. Xu, H. Zhang, L. Yang, R. Liu, G. Huang, M. Xu, and X. Liu, “Fast on-device llm inference with npus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 445–462

  7. [7]

    Aif: Accelerating on-device llm inference using in-flash processing,

    J. Lee, H. Kim, S. Oh, M. Chun, M. Kim, and J. Kim, “Aif: Accelerating on-device llm inference using in-flash processing,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 529–543

  8. [8]

    Edgellm: Fast on-device llm inference with speculative decoding,

    D. Xu, W. Yin, H. Zhang, X. Jin, Y . Zhang, S. Wei, M. Xu, and X. Liu, “Edgellm: Fast on-device llm inference with speculative decoding,” IEEE Transactions on Mobile Computing, 2024

  9. [9]

    Edgeshard: Efficient llm inference via collaborative edge computing,

    M. Zhang, X. Shen, J. Cao, Z. Cui, and S. Jiang, “Edgeshard: Efficient llm inference via collaborative edge computing,”IEEE Internet of Things Journal, vol. 12, no. 10, pp. 13 119–13 131, 2025

  10. [10]

    Edge-llm: A collaborative framework for large language model serving in edge computing,

    F. Cai, D. Yuan, Z. Yang, and L. Cui, “Edge-llm: A collaborative framework for large language model serving in edge computing,” in 2024 IEEE International Conference on Web Services (ICWS). IEEE, 2024, pp. 799–809

  11. [11]

    Switchable and dual-tunable multilayered terahertz absorber based on patterned graphene and vanadium dioxide,

    H. Liu, P. Wang, J. Wu, X. Yan, X. Yuan, Y . Zhang, and X. Zhang, “Switchable and dual-tunable multilayered terahertz absorber based on patterned graphene and vanadium dioxide,”Micromachines, vol. 12, no. 6, p. 619, 2021

  12. [12]

    Research on terahertz band electromagnetic characteristics of propagation and scattering in the cold magnetized plasma medium,

    H.-y. Liu and Y . Chao, “Research on terahertz band electromagnetic characteristics of propagation and scattering in the cold magnetized plasma medium,”Optik, vol. 217, p. 164905, 2020

  13. [13]

    Cachegen: Kv cache compression and streaming for fast large language model serving,

    Y . Liu, H. Li, Y . Cheng, S. Ray, Y . Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan,et al., “Cachegen: Kv cache compression and streaming for fast large language model serving,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 38–56

  14. [14]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

  15. [15]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,”arXiv preprint arXiv:2307.08691, 2023

  16. [16]

    Openclaw: Personal ai assistant,

    P. Steinberger, “Openclaw: Personal ai assistant,” https://openclaw.ai/

  17. [17]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for kv cache,” arXiv preprint arXiv:2402.02750, 2024

  18. [18]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models,

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett,et al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 34 661–34 710, 2023

  19. [19]

    Jiang, Q

    H. Jiang, Q. Wu, C.-Y . Lin, Y . Yang, and L. Qiu, “Llmlingua: Com- pressing prompts for accelerated inference of large language models,” arXiv preprint arXiv:2310.05736, 2023

  20. [20]

    {InfiniGen}: Efficient generative inference of large language models with dynamic{KV}cache manage- ment,

    W. Lee, J. Lee, J. Seo, and J. Sim, “{InfiniGen}: Efficient generative inference of large language models with dynamic{KV}cache manage- ment,” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 155–172

  21. [21]

    W. Chen, S. He, H. Qu, R. Zhang, S. Yang, P. Chen, Y . Zheng, B. Huai, and G. Chen, “{IMPRESS}: An{Importance-Informed}{Multi-Tier} LIU et al. SPARKV: OVERHEAD-AW ARE KV CACHE LOADING FOR EFFICIENT ON-DEVICE LLM INFERENCE 11 prefix{KV}storage system for large language model inference,” in 23rd USENIX Conference on File and Storage Technologies (FAST 25), ...

  22. [22]

    Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference,

    J. Zhang, C. Xiang, H. Huang, H. Xi, J. Zhu, J. Chen,et al., “Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference,” inForty-second International Conference on Machine Learning, 2025

  23. [23]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention,

    H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin,et al., “Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention,”Advances in Neural Information Processing Systems, vol. 37, pp. 52 481–52 515, 2024

  24. [24]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023

  25. [25]

    Compute or load kv cache? why not both?

    S. Jin, X. Liu, Q. Zhang, and Z. M. Mao, “Compute or load kv cache? why not both?”arXiv preprint arXiv:2410.03065, 2024

  26. [26]

    Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

  27. [27]

    Llm as a system service on mobile devices,

    W. Yin, M. Xu, Y . Li, and X. Liu, “Llm as a system service on mobile devices,”arXiv preprint arXiv:2403.11805, 2024

  28. [28]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  29. [29]

    Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

    R. Xu, G. Xiao, H. Huang, J. Guo, and S. Han, “Xattention: Block sparse attention with antidiagonal scoring,”arXiv preprint arXiv:2503.16428, 2025

  30. [30]

    Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization.arXiv preprint arXiv:2411.10958, 2024

    J. Zhang, H. Huang, P. Zhang, J. Wei, J. Zhu, and J. Chen, “Sageatten- tion2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization,”arXiv preprint arXiv:2411.10958, 2024

  31. [31]

    Alibaba cloud,

    Alibaba, “Alibaba cloud,” https://www.alibabacloud.com, 2025

  32. [32]

    Llama-3.1-8b,

    Meta AI Team, “Llama-3.1-8b,” https://huggingface.co/meta-llama/ Llama-3.1-8B, 2024

  33. [33]

    Transformers: State- of-the-art natural language processing,

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz,et al., “Transformers: State- of-the-art natural language processing,” inProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45

  34. [34]

    llama.cpp,

    “llama.cpp,” https://github.com/ggml-org/llama.cpp, 2026

  35. [35]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” arXiv preprint arXiv:1705.03551, 2017

  36. [36]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,”arXiv preprint arXiv:1809.09600, 2018

  37. [37]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 108–24 118

  38. [38]

    Dynamic huffman coding,

    D. E. Knuth, “Dynamic huffman coding,”Journal of algorithms, vol. 6, no. 2, pp. 163–180, 1985

  39. [39]

    Gurobi optimizer reference manual, version 11.0,

    Gurobi Optimization, LLC, “Gurobi optimizer reference manual, version 11.0,” https://www.gurobi.com, 2024

  40. [40]

    Roofline: an insightful visual performance model for multicore architectures,

    S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

  41. [41]

    Multilayer perceptron and neural networks,

    M.-C. Popescu, V . E. Balas, L. Perescu-Popescu, and N. Mastorakis, “Multilayer perceptron and neural networks,”WSEAS Transactions on Circuits and Systems, vol. 8, no. 7, pp. 579–588, 2009

  42. [42]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu,et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24 185–24 198

  43. [43]

    RepoBench : Benchmarking repository-level code auto-completion systems

    T. Liu, C. Xu, and J. McAuley, “Repobench: Benchmarking repository- level code auto-completion systems,”arXiv preprint arXiv:2306.03091, 2023

  44. [44]

    How long can open-source LLMs truly promise on context length?

    D. Li, R. Shao, A. Xie, Y . Sheng, L. Zheng, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can open-source LLMs truly promise on context length?” https://lmsys.org/blog/2023-06-29-longchat, Jun 2023

  45. [45]

    Efficient attentions for long document summarization,

    L. Huang, S. Cao, N. Parulian, H. Ji, and L. Wang, “Efficient attentions for long document summarization,”arXiv preprint arXiv:2104.02112, 2021

  46. [46]

    The narrativeqa reading comprehension challenge,

    T. Ko ˇcisk`y, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette, “The narrativeqa reading comprehension challenge,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018

  47. [47]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou,et al., “Longbench: A bilingual, multitask benchmark for long context understanding,”arXiv preprint arXiv:2308.14508, 2023

  48. [48]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

    Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong,et al., “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks,”arXiv preprint arXiv:2412.15204, 2024

  49. [49]

    A survey of wi-fi 6: Technologies, advances, and challenges,

    E. Mozaffariahrar, F. Theoleyre, and M. Menth, “A survey of wi-fi 6: Technologies, advances, and challenges,”Future Internet, vol. 14, no. 10, p. 293, 2022

  50. [50]

    Industrial internet of things with large language models (llms): an intelligence- based reinforcement learning approach,

    Y . Ren, H. Zhang, F. R. Yu, W. Li, P. Zhao, and Y . He, “Industrial internet of things with large language models (llms): an intelligence- based reinforcement learning approach,”IEEE Transactions on Mobile Computing, 2024

  51. [51]

    Next-gen service function chain deployment: Combining multi-objective optimiza- tion with ai large language models,

    Y . Li, Q. Zhang, H. Yao, R. Gao, X. Xin, and M. Guizani, “Next-gen service function chain deployment: Combining multi-objective optimiza- tion with ai large language models,”IEEE Network, 2025

  52. [52]

    Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong- Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu

    C.-C. Chang, C.-Y . Lin, Y . Akhauri, W.-C. Lin, K.-C. Wu, L. Ceze, and M. S. Abdelfattah, “xkv: Cross-layer svd for kv-cache compression,” arXiv preprint arXiv:2503.18893, 2025

  53. [53]

    Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111,

    R. Zhang, K. Wang, L. Liu, S. Wang, H. Cheng, C. Zhang, and Y . Shen, “Lorc: Low-rank compression for llms kv cache with a progressive compression strategy,”arXiv preprint arXiv:2410.03111, 2024

  54. [54]

    Cacheblend: Fast large language model serving for rag with cached knowledge fusion,

    J. Yao, H. Li, Y . Liu, S. Ray, Y . Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang, “Cacheblend: Fast large language model serving for rag with cached knowledge fusion,” inProceedings of the Twentieth European Conference on Computer Systems, 2025, pp. 94–109