pith. machine review for the scientific record. sign in

arxiv: 2605.11733 · v1 · submitted 2026-05-12 · 💻 cs.CE · cs.DC

Recognition: no theorem link

Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:13 UTC · model grok-4.3

classification 💻 cs.CE cs.DC
keywords LLM inferenceenergy-to-token productiontoken production functiondata center powerPUEjoules per tokeninference evaluationutilization
0
0 comments X

The pith

LLM inference should be treated as energy-to-token production bounded by both compute and power ceilings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current evaluations of LLM inference, focused on accuracy, latency, and hardware utilization, are incomplete at deployment scale where delivered power, cooling, PUE, and operational efficiency become the dominant limits. It proposes reframing inference as energy-to-token production and formalizes this with a dimensionally consistent Token Production Function in which token rate is bounded by ceilings on both compute-per-token and energy-per-token. Under fixed quality and service targets, techniques such as KV-cache compression, quantization, and sparse attention act as levers that reduce joules per token or utilization losses. The authors therefore advocate reporting Joules/token, the active binding constraint, PUE-adjusted power, and utilization-adjusted output in all future inference papers and benchmarks.

Core claim

LLM inference should be evaluated as energy-to-token production. The authors formalize this view with a dimensionally consistent Token Production Function in which token rate is bounded by both compute-per-token and energy-per-token ceilings. At deployment scale the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. System optimizations are reframed as energy-to-token levers because they reduce FLOPs/token, joules/token, memory traffic, or utilization losses under fixed quality and service targets.

What carries the argument

The Token Production Function, a dimensionally consistent model that bounds token production rate by ceilings on both compute per token and energy per token.

If this is right

  • Optimizations such as KV-cache compression, quantization, and sparse attention reduce joules per token or utilization losses under fixed quality and service targets.
  • Inference papers and benchmarks must report Joules/token, the active binding constraint, PUE-adjusted delivered power, and utilization-adjusted token output.
  • API price dispersion across providers may reflect differences in underlying energy costs, though this serves only as directional motivation.
  • The binding constraint shifts from theoretical peak compute to delivered power and operational efficiency when scale increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware roadmaps could shift priority toward maximizing sustained power delivery efficiency rather than peak FLOPS.
  • Model architectures might be co-designed specifically to minimize energy per token at inference time rather than training FLOPs alone.
  • Data-center planning could incorporate renewable-energy matching directly into inference workload scheduling.
  • Evaluation suites could add simulated power and cooling caps to test whether claimed throughputs hold under realistic constraints.

Load-bearing premise

That at deployment scale the binding constraint on LLM inference moves from theoretical peak compute toward delivered power, cooling capacity, PUE, and operational efficiency.

What would settle it

An observation that even in large-scale deployments the primary limit on high-quality token output remains available FLOPS rather than delivered power or energy delivery.

Figures

Figures reproduced from arXiv: 2605.11733 by Bo Li, Kaiyong Zhao, Peijie Dong, Qiang Wang, Shimiao Yuan, Xiang Liu, Xiaowen Chu, Zhenheng Tang.

Figure 1
Figure 1. Figure 1: The thermodynamics of token generation, illustrating how the Token Production Function [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left axis (blue): global data center electricity (TWh/yr), 2020–2030 (IEA measured, central, and high scenarios with projection band) [1, 29]. Right axis (green, log): illustrative Φsystem proxy normalized to 2020=1, with step-changes anchored to documented system-level deployments (Epoch 2: FlashAttention, vLLM/PagedAttention, INT4/AWQ; Epoch 3: MLA, NSA, sparse-hybrid). Energy grows roughly linearly whil… view at source ↗
Figure 3
Figure 3. Figure 3: Architectural efficiency comparison across optimization strategies. Bars summarize reported [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Divergent trajectories of AI ecosystem archetypes. Path A (infrastructure-constrained) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

LLM inference is still evaluated mainly as a model or software problem: accuracy, latency, throughput, and hardware utilization. This is incomplete. At deployment scale, the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. We argue that the ML community should treat inference as \emph{energy-to-token production}. We formalize this view with a dimensionally consistent Token Production Function in which token rate is bounded by both compute-per-token and energy-per-token ceilings. Listed API prices vary by over an order of magnitude across providers, but we use price dispersion only as directional motivation, not as causal evidence of marginal cost. The core physical question is instead: under fixed quality and service targets, when does the binding constraint move from theoretical peak compute toward delivered power, cooling, and operational efficiency? Under this framing, system optimizations -- latent KV-cache compression, sparse or heavily compressed attention, quantization, routing, and difficulty-adaptive reasoning -- are not merely local engineering tricks. They are energy-to-token levers because they reduce FLOPs/token, joules/token, memory traffic, or utilization losses under fixed $(q^{*},s^{*})$. We therefore call for inference papers and benchmarks to report Joules/token, active binding constraint, PUE-adjusted delivered power, and utilization-adjusted token output alongside accuracy and latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript argues that LLM inference should be evaluated as energy-to-token production rather than primarily through model accuracy, latency, and hardware utilization metrics. It proposes a dimensionally consistent Token Production Function that bounds token production rates by both compute and energy constraints. The authors emphasize that at large deployment scales, factors such as delivered power, cooling capacity, PUE, and utilization become critical, and advocate for new reporting standards in inference papers and benchmarks, including Joules/token and active binding constraints.

Significance. If the central premise holds—that energy and power constraints become the binding limits on LLM inference at deployment scale—this position could substantially influence the ML community by redirecting optimization efforts and benchmarking practices toward energy efficiency and sustainable computing. It highlights the need to consider operational data center realities in model evaluation, potentially leading to more holistic system designs.

major comments (1)
  1. The premise that the binding constraint on token production shifts from theoretical peak compute to delivered power, cooling, PUE, and operational efficiency at deployment scale lacks supporting measurements, utilization data, power traces, or scaling analysis from real deployments. This assumption is load-bearing for the justification of treating inference as energy-to-token production and reorienting benchmarks accordingly.
minor comments (1)
  1. The Token Production Function is described as dimensionally consistent but no explicit mathematical formulation, equations, or derivation is provided, making it difficult to assess its novelty or applicability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive review and the opportunity to respond. We address the single major comment below.

read point-by-point responses
  1. Referee: The premise that the binding constraint on token production shifts from theoretical peak compute to delivered power, cooling, PUE, and operational efficiency at deployment scale lacks supporting measurements, utilization data, power traces, or scaling analysis from real deployments. This assumption is load-bearing for the justification of treating inference as energy-to-token production and reorienting benchmarks accordingly.

    Authors: We agree that the manuscript does not contain new empirical measurements, power traces, or utilization data from production deployments. As a position paper, the central contribution is the dimensionally consistent Token Production Function together with the recommendation to report Joules/token and active binding constraints; the premise that power and cooling become binding at scale is presented as a physically motivated hypothesis rather than a claim proven by new data. The text already qualifies API price dispersion as directional motivation only and does not assert causal evidence of marginal cost. We will make a partial revision by inserting a short paragraph that (a) explicitly labels the constraint-shift claim as a hypothesis grounded in known data-center characteristics (PUE > 1, finite rack power delivery, cooling limits) and (b) calls for future empirical studies to measure when and at what scale the binding constraint moves. This addition will clarify the evidential scope without altering the proposed evaluation framework. revision: partial

standing simulated objections not resolved
  • We cannot supply proprietary power traces, utilization statistics, or scaling measurements from commercial LLM deployments, as such operational data are not publicly available and we have no access to them.

Circularity Check

0 steps flagged

Conceptual position paper with no derivation chain or self-referential reductions

full rationale

The manuscript is a position paper that advocates reframing LLM inference evaluation around energy-to-token production and introduces a Token Production Function as a dimensional formalization. No equations, parameter fits, or step-by-step derivations appear in the provided text. The authors explicitly disclaim price dispersion as causal evidence and pose the binding-constraint question as open rather than solved by internal construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The argument therefore contains no steps that reduce by construction to their own inputs; it remains a self-contained conceptual proposal whose validity rests on external physical and operational data rather than circular redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that energy and power constraints dominate at deployment scale; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption At deployment scale, the binding constraint on LLM inference shifts from peak compute to delivered power, cooling, and operational efficiency.
    This premise is invoked to justify why energy-to-token becomes the primary evaluation lens.

pith-pipeline@v0.9.0 · 5573 in / 1174 out tokens · 54020 ms · 2026-05-13T05:13:02.068428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 13 internal anchors

  1. [1]

    Energy and ai

    International Energy Agency. Energy and ai. Technical report, IEA Special Report, 2025. URL https://www.iea.org/reports/energy-and-ai

  2. [2]

    Analyzing artificial intelligence and data center energy consumption

    Electric Power Research Institute. Analyzing artificial intelligence and data center energy consumption. Technical Report 3002028905, EPRI, 2024. URL https://www.epri.com/r esearch/products/3002028905. EPRI White Paper No. 3002028905

  3. [3]

    Ai factories

    NVIDIA. Ai factories. NVIDIA solutions page, 2026. URL https://www.nvidia.com/e n-us/solutions/ai-factories/

  4. [4]

    Ai inference

    NVIDIA. Ai inference. NVIDIA solutions page, 2026. URL https://www.nvidia.com/e n-us/solutions/ai/inference/

  5. [5]

    Api pricing

    OpenAI. Api pricing. OpenAI documentation, 2026. URL https://openai.com/api/pri cing/

  6. [6]

    Models overview and api pricing

    Anthropic. Models overview and api pricing. Anthropic documentation, 2026. URL https: //docs.anthropic.com/en/docs/models-overview

  7. [7]

    Models and pricing

    DeepSeek. Models and pricing. DeepSeek API documentation, 2026. URL https://api-d ocs.deepseek.com/quick_start/pricing

  8. [8]

    Schwartz, J

    R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green ai.Communications of the ACM, 63 (12):54–63, 2020. doi: 10.1145/3381831. URLhttps://doi.org/10.1145/3381831

  9. [9]

    Patterson, J

    D. Patterson, J. Gonzalez, Q. Le, C. Liang, L. M. Munguia, D. Rothchild, J. Dean, et al. Carbon emissions and large neural network training, 2021. URL https://arxiv.org/abs/2104.1

  10. [10]

    arXiv preprint arXiv:2104.10350

  11. [11]

    Patterson, J

    D. Patterson, J. Gonzalez, U. Hölzle, Q. Le, C. Liang, L.-M. Munguia, J. Dean, et al. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022. doi: 10.1109/MC.2022.3148714. URLhttps://doi.org/10.1109/MC.2022.3148714

  12. [12]

    C. J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, K. Hazelwood, et al. Sustainable ai: Environmental implications, challenges and opportunities. InProceedings of Machine Learning and Systems (MLSys), volume 4, pages 795–813, 2022. URL https: //arxiv.org/abs/2111.00364

  13. [13]

    Quantifying the Carbon Emissions of Machine Learning

    A. Lacoste, A. Luccioni, V . Schmidt, and T. Dandres. Quantifying the carbon emissions of machine learning, 2019. URL https://arxiv.org/abs/1910.09700 . arXiv preprint arXiv:1910.09700

  14. [14]

    In: Korhonen, A., Traum, D., Màrquez, L

    E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in nlp. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3645–3650, 2019. URLhttps://arxiv.org/abs/1906.02243

  15. [15]

    A. S. Luccioni, Y . Jernite, and E. Strubell. Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (F AccT), 2024. doi: 10.1145/3630106.3658542. URL h t t p s : //doi.org/10.1145/3630106.3658542

  16. [16]

    Mlperf inference v4.1 power results

    MLCommons. Mlperf inference v4.1 power results. Technical report, MLCommons, 2024. URL https://mlcommons.org/benchmarks/inference-datacenter/ . MLCommons Technical Report. 10

  17. [17]

    W. W. Leontief.The Structure of American Economy, 1919–1929: An Empirical Application of Equilibrium Analysis. Harvard University Press, 1941

  18. [18]

    K. J. Arrow, H. B. Chenery, B. S. Minhas, and R. M. Solow. Capital-labor substitution and economic efficiency.The Review of Economics and Statistics, 43(3):225–250, 1961. doi: 10.2307/1927286. URLhttps://doi.org/10.2307/1927286

  19. [19]

    2024 global data center survey results

    Uptime Institute. 2024 global data center survey results. Technical report, Uptime Institute,

  20. [20]

    Global average PUE: 1.56; industry leaders: 1.08–1.09

  21. [21]

    Samsi, D

    S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, J. Kepner, et al. From words to watts: Benchmarking the energy costs of large language model inference, 2023. URL https://arxiv.org/abs/2310.03003. arXiv preprint arXiv:2310.03003

  22. [22]

    C. Niu, W. Zhang, J. Li, Y . Zhao, T. Wang, X. Wang, Y . Chen, et al. Tokenpowerbench: Benchmarking the power consumption of llm inference, 2025. URL https://arxiv.org/ab s/2512.03024. arXiv preprint arXiv:2512.03024

  23. [23]

    R. M. Solow. Technical change and the aggregate production function.Review of Economics and Statistics, 39(3):312–320, 1957. doi: 10.2307/1926047. URL https://doi.org/10.2 307/1926047

  24. [24]

    Roofline: An insightful visual performance model for multicore architectures,

    S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009. doi: 10.1145/1498765.1498785. URLhttps://doi.org/10.1145/1498765.1498785

  25. [25]

    Sevilla and E

    J. Sevilla and E. Roldán. Training compute of frontier ai models grows by 4-5x per year. Epoch AI Blog, 2024. URL https://epochai.org/blog/training-compute-of-frontier-a i-models-grows-by-4-5x-per-year

  26. [26]

    Chung, R

    J.-W. Chung, R. Wu, J. J. Ma, and M. Chowdhury. Where do the joules go? diagnosing inference energy consumption, 2026. URL https://arxiv.org/abs/2601.22076 . arXiv preprint arXiv:2601.22076

  27. [27]

    Delavande, R

    J. Delavande, R. Pierrard, and S. Luccioni. Understanding efficiency: Quantization, batching, and serving strategies in llm energy use, 2026. URL https://arxiv.org/abs/2601.22362. arXiv preprint arXiv:2601.22362

  28. [28]

    H. P. Cavagna, A. Proia, G. Madella, G. B. Esposito, F. Antici, D. Cesarini, Z. Kiziltan, and A. Bartolini. Sweetspot: An analytical model for predicting energy efficiency of llm inference,

  29. [29]

    SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

    URLhttps://arxiv.org/abs/2602.05695. arXiv preprint arXiv:2602.05695

  30. [30]

    Nvidia h100 tensor core gpu: Product specifications

    NVIDIA. Nvidia h100 tensor core gpu: Product specifications. NVIDIA product page, 2026. URLhttps://www.nvidia.com/en-us/data-center/h100/

  31. [31]

    Nvidia hgx platform specifications (hgx h100 4/8-gpu)

    NVIDIA. Nvidia hgx platform specifications (hgx h100 4/8-gpu). NVIDIA product page, 2026. URLhttps://www.nvidia.com/en-us/data-center/hgx

  32. [32]

    Ai is set to drive surging electricity demand from data centres while offering the potential to transform how the energy sector works

    International Energy Agency. Ai is set to drive surging electricity demand from data centres while offering the potential to transform how the energy sector works. IEA News, 2025. URL https://www.iea.org/news/ai-is-set-to-drive-surging-electricity-deman d-from-data-centres-while-offering-the-potential-to-transform-how-the -energy-sector-works

  33. [33]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. Technical report, DeepSeek-AI, 2024. URL https://arxiv.org/abs/2405.04434. arXiv:2405.04434

  34. [34]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, D. Amodei, et al. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361. arXiv preprint arXiv:2001.08361

  35. [35]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, L. Sifre, et al. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/22 03.15556. arXiv preprint arXiv:2203.15556. 11

  36. [36]

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, G. Irving, et al. Scaling language models: Methods, analysis and insights from training gopher, 2021. URL https: //arxiv.org/abs/2112.11446. arXiv preprint arXiv:2112.11446

  37. [37]

    Sevilla, L

    J. Sevilla, L. Heim, A. Ho, T. Besiroglu, M. Hobbhahn, and P. Villalobos. Compute trends across three eras of machine learning. In2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2022. URLhttps://arxiv.org/abs/2202.05924

  38. [38]

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022. URLhttps://arxiv.org/abs/2205.14135

  39. [39]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, I. Stoica, et al. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023. URL https: //arxiv.org/abs/2309.06180

  40. [40]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. InProceedings of the 11th International Conference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2210.17323

  41. [41]

    J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InProceedings of Machine Learning and Systems, volume 6, 2024. URLhttps://arxiv.org/abs/2306.00978

  42. [43]

    URLhttps://arxiv.org/abs/2311.03687

  43. [44]

    In Ku, L.-W., Martins, A

    Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. LongGenBench: Long-context generation benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 865–883. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024 .findings-emnlp.48. URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.48

  44. [45]

    Department of Energy

    U.S. Department of Energy. Doe releases new report evaluating increase in electricity demand from data centers. Technical report, U.S. Department of Energy, 2024. URL https://www.en ergy.gov/articles/doe-releases-new-report-evaluating-increase-electrici ty-demand-data-centers. DOE News

  45. [46]

    Juniewicz

    I. Juniewicz. Hyperscaler capex has quadrupled since gpt-4’s release. Epoch AI Data Insights,

  46. [47]

    Combined Alphabet, Amazon, Meta, Microsoft, and Oracle capex extracted from SEC EDGAR 10-Q/10-K filings

    URL https://epoch.ai/data- insights/hyperscaler- capex- trend/ . Combined Alphabet, Amazon, Meta, Microsoft, and Oracle capex extracted from SEC EDGAR 10-Q/10-K filings

  47. [48]

    L. Liu. Speech at the china development forum 2026: Token (“ciyuan”) as the value anchor of the intelligent era; daily token-call volume in china exceeds 140 trillion as of march 2026. National Data Administration of China, 2026. URL https://www.nda.gov.cn/sjj/swdt/ mtsy/0325/20260325113132934906079_pc.html

  48. [49]

    Doubao surpasses 120 trillion daily tokens as usage doubles in three months

    TechNode. Doubao surpasses 120 trillion daily tokens as usage doubles in three months. TechNode, 2026. URL https://technode.com/2026/04/07/doubao-surpasses-120 -trillion-daily-tokens-as-usage-doubles-in-three-months/

  49. [50]

    W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious.ACM SIGARCH Computer Architecture News, 23(1):20–24, 1995. doi: 10.1145/216585.216588. URLhttps://doi.org/10.1145/216585.216588

  50. [51]

    J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, W. Zeng, et al. Native sparse attention: Hardware- aligned and natively trainable sparse attention. InProceedings of ACL 2025, 2025. URL https://arxiv.org/abs/2502.11089. Best Paper; arXiv:2502.11089. 12

  51. [52]

    Deepseek-v4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, DeepSeek-AI, 2026. URL https://huggingface.co/deepseek-ai/De epSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical report

  52. [53]

    X. Liu, Z. Tang, P. Dong, Z. Li, Y . Liu, B. Li, X. Hu, and X. Chu. Chunkkv: Semantic- preserving kv cache compression for efficient long-context llm inference. InAdvances in Neural Information Processing Systems (NeurIPS) 39, 2025. URL https://arxiv.org/abs/2502 .00299. arXiv:2502.00299

  53. [54]

    arXiv preprint arXiv:2306.14048 , year=

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, B. Chen, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://arxiv.org/abs/2306.14048

  54. [55]

    FlowKV: Enhancing multi-turn conversational coherence in LLMs via isolated key-value cache management

    Xiang Liu, Hong Chen, Xuming Hu, and Xiaowen Chu. FlowKV: Enhancing multi-turn conversational coherence in LLMs via isolated key-value cache management. InNeurIPS Workshop on Multi-Turn Interactions in Large Language Models, 2025. URL https://arxiv. org/abs/2505.15347

  55. [56]

    Semantic integrity matters: Benchmarking and preserving high-density reasoning in KV cache compression

    Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, and Xiaowen Chu. Semantic integrity matters: Benchmarking and preserving high-density reasoning in KV cache compression. InInternational Conference on Machine Learning (ICML),

  56. [57]

    URLhttps://arxiv.org/abs/2502.01941

  57. [58]

    SONIC: Segmented optimized nexus for information compression in key-value caching.arXiv preprint arXiv:2601.21927, 2026

    Hong Chen, Xiang Liu, Bo Wang, Yuxuan Fan, Yuanlin Chu, Zongluo Li, Xiaowen Chu, and Xuming Hu. SONIC: Segmented optimized nexus for information compression in key-value caching.arXiv preprint arXiv:2601.21927, 2026. URL https://arxiv.org/abs/2601.2 1927

  58. [59]

    AnTKV: Anchor token-aware sub-bit vector quantization for KV cache in large language models.arXiv preprint arXiv:2506.19505, 2025

    Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, and Xiaowen Chu. AnTKV: Anchor token-aware sub-bit vector quantization for KV cache in large language models.arXiv preprint arXiv:2506.19505, 2025. URL https: //arxiv.org/abs/2506.19505

  59. [60]

    Ora- cleKV: Oracle guidance for question-independent KV cache eviction

    Yuanbing Zhu, Zhenheng Tang, Xiang Liu, Ang Li, Bo Li, Xiaowen Chu, and Bo Han. Ora- cleKV: Oracle guidance for question-independent KV cache eviction. InICML Workshop on Long-Context F oundation Models, 2025. URL https://openreview.net/pdf?id=KHM2YO GgX9

  60. [61]

    Sheng, L

    Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, I. Stoica, et al. Flexgen: High- throughput generative inference of large language models with a single gpu. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 31094–31116, 2023. URLhttps://arxiv.org/abs/2303.06865

  61. [62]

    MiniMax, A. Li, B. Gong, B. Yang, B. Shan, et al. Minimax-01: Scaling foundation models with lightning attention, 2025. URL https://arxiv.org/abs/2501.08313. arXiv preprint arXiv:2501.08313

  62. [63]

    X. Liu, X. Hu, X. Chu, and E. Choi. Diffadapt: Difficulty-adaptive reasoning for token-efficient llm inference, 2025. URL https://arxiv.org/abs/2510.19669 . arXiv preprint arXiv:2510.19669

  63. [64]

    Reasoning language model inference serving unveiled: An empirical study

    Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, and Xiaowen Chu. Reasoning language model inference serving unveiled: An empirical study. InInternational Conference on Learning Representations (ICLR), 2026. URL https: //arxiv.org/abs/2510.18672

  64. [65]

    Can compressed llms truly act? an empirical evaluation of agentic capabilities in llm compression, arXiv preprint arXiv:2505.19433, 2025

    Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, and Bo Li. Can compressed LLMs truly act? an empirical evaluation of agentic capabilities in LLM compression. In International Conference on Machine Learning (ICML), 2025. URL https://arxiv.org/ab s/2505.19433

  65. [66]

    The lottery LLM hypothesis, rethinking what abilities should LLM compression preserve? In ICLR Blogposts Track, 2025

    Zhenheng Tang, Xiang Liu, Qian Wang, Peijie Dong, Bingsheng He, Xiaowen Chu, and Bo Li. The lottery LLM hypothesis, rethinking what abilities should LLM compression preserve? In ICLR Blogposts Track, 2025. URLhttps://arxiv.org/abs/2502.17535. 13

  66. [67]

    National energy administration releases 2024 national electric power industry statistics

    National Energy Administration. National energy administration releases 2024 national electric power industry statistics. National Energy Administration of China, 2025. URL https: //www.nea.gov.cn/20250121/097bfd7c1cd3498897639857d86d5dac/c.html

  67. [68]

    Interpretation of the work plan for stabilizing growth in the power equipment industry (2025–2026)

    Ministry of Industry and Information Technology, State Administration for Market Regulation, and National Energy Administration. Interpretation of the work plan for stabilizing growth in the power equipment industry (2025–2026). Technical report, Ministry of Industry and Information Technology, 2025. URL https://www.miit.gov.cn/zwgk/zcjd/art/2025 /art_44b...

  68. [69]

    Launch of second data centre – call for application

    Infocomm Media Development Authority and Singapore Economic Development Board. Launch of second data centre – call for application. Technical report, IMDA / EDB, 2025. URL https://www.imda.gov.sg/resources/press-releases-factsheets-and-speeche s/factsheets/2025/launch-of-second-data-centre. IMDA / EDB Factsheet

  69. [70]

    State of ai: token-usage rankings, q1 2026

    OpenRouter. State of ai: token-usage rankings, q1 2026. OpenRouter, 2026. URL https: //openrouter.ai/state-of-ai

  70. [71]

    2025 ai index report: Ai model performance gaps narrowing, compute costs plummeting

    Stanford Institute for Human-Centered Artificial Intelligence. 2025 ai index report: Ai model performance gaps narrowing, compute costs plummeting. Technical report, Stanford HAI, 2025. URLhttps://aiindex.stanford.edu/report/

  71. [72]

    N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso. The computational limits of deep learning, 2020. URL https://arxiv.org/abs/2007.05558 . arXiv preprint arXiv:2007.05558

  72. [73]

    W. S. Jevons.The Coal Question: An Inquiry Concerning the Progress of the Nation, and the Probable Exhaustion of Our Coal-Mines. Macmillan and Co., London, 1865

  73. [74]

    S. Sorrell. Jevons’ paradox revisited: The evidence for backfire from improved energy efficiency. Energy Policy, 37(4):1456–1469, 2009. doi: 10.1016/j.enpol.2008.12.003. URL https: //doi.org/10.1016/j.enpol.2008.12.003

  74. [75]

    Appenzeller

    G. Appenzeller. Welcome to llmflation: Llm inference cost is going down fast. Andreessen Horowitz, 2024. URLhttps://a16z.com/llmflation-llm-inference-cost/

  75. [76]

    Demirer, A

    M. Demirer, A. Fradkin, N. Tadelis, and S. Peng. The emerging market for intelligence: pricing, supply, and demand for llms. Technical Report 34608, National Bureau of Economic Research,

  76. [77]

    NBER Working Paper No

    URLhttps://www.nber.org/papers/w34608. NBER Working Paper No. 34608

  77. [78]

    High cost of energy: industrial electricity prices in the eu vs the us and china

    BusinessEurope. High cost of energy: industrial electricity prices in the eu vs the us and china. BusinessEurope Data Hub, 2024. URL https://www.businesseurope.eu/media-room/ data-hub/high-cost-of-energy/

  78. [79]

    Shapiro and H

    C. Shapiro and H. R. Varian.Information Rules: A Strategic Guide to the Network Economy. Harvard Business School Press, 1999

  79. [80]

    Energy Information Administration

    U.S. Energy Information Administration. Electric power monthly, table 5.6.b: Average price of electricity to ultimate customers by end-use sector, by state (december 2025 ytd). Technical report, U.S. Energy Information Administration, 2026. URL https://www.eia.gov/electr icity/monthly/epm_table_grapher.php?t=epmt_5_6_b

  80. [81]

    E. Wu. Sovereignty and data localization. Technical report, Belfer Center for Science and International Affairs, Harvard Kennedy School, 2021. URL https://www.belfercenter.o rg/publication/sovereignty-and-data-localization

Showing first 80 references.