arxiv: 2605.11733 · v1 · submitted 2026-05-12 · 💻 cs.CE · cs.DC

Recognition: no theorem link

Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

Xiang Liu , Shimiao Yuan , Zhenheng Tang , Peijie Dong , Kaiyong Zhao , Qiang Wang , Bo Li , Xiaowen Chu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:13 UTC · model grok-4.3

classification 💻 cs.CE cs.DC

keywords LLM inferenceenergy-to-token productiontoken production functiondata center powerPUEjoules per tokeninference evaluationutilization

0 comments

The pith

LLM inference should be treated as energy-to-token production bounded by both compute and power ceilings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current evaluations of LLM inference, focused on accuracy, latency, and hardware utilization, are incomplete at deployment scale where delivered power, cooling, PUE, and operational efficiency become the dominant limits. It proposes reframing inference as energy-to-token production and formalizes this with a dimensionally consistent Token Production Function in which token rate is bounded by ceilings on both compute-per-token and energy-per-token. Under fixed quality and service targets, techniques such as KV-cache compression, quantization, and sparse attention act as levers that reduce joules per token or utilization losses. The authors therefore advocate reporting Joules/token, the active binding constraint, PUE-adjusted power, and utilization-adjusted output in all future inference papers and benchmarks.

Core claim

LLM inference should be evaluated as energy-to-token production. The authors formalize this view with a dimensionally consistent Token Production Function in which token rate is bounded by both compute-per-token and energy-per-token ceilings. At deployment scale the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. System optimizations are reframed as energy-to-token levers because they reduce FLOPs/token, joules/token, memory traffic, or utilization losses under fixed quality and service targets.

What carries the argument

The Token Production Function, a dimensionally consistent model that bounds token production rate by ceilings on both compute per token and energy per token.

If this is right

Optimizations such as KV-cache compression, quantization, and sparse attention reduce joules per token or utilization losses under fixed quality and service targets.
Inference papers and benchmarks must report Joules/token, the active binding constraint, PUE-adjusted delivered power, and utilization-adjusted token output.
API price dispersion across providers may reflect differences in underlying energy costs, though this serves only as directional motivation.
The binding constraint shifts from theoretical peak compute to delivered power and operational efficiency when scale increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware roadmaps could shift priority toward maximizing sustained power delivery efficiency rather than peak FLOPS.
Model architectures might be co-designed specifically to minimize energy per token at inference time rather than training FLOPs alone.
Data-center planning could incorporate renewable-energy matching directly into inference workload scheduling.
Evaluation suites could add simulated power and cooling caps to test whether claimed throughputs hold under realistic constraints.

Load-bearing premise

That at deployment scale the binding constraint on LLM inference moves from theoretical peak compute toward delivered power, cooling capacity, PUE, and operational efficiency.

What would settle it

An observation that even in large-scale deployments the primary limit on high-quality token output remains available FLOPS rather than delivered power or energy delivery.

Figures

Figures reproduced from arXiv: 2605.11733 by Bo Li, Kaiyong Zhao, Peijie Dong, Qiang Wang, Shimiao Yuan, Xiang Liu, Xiaowen Chu, Zhenheng Tang.

**Figure 2.** Figure 2: Left axis (blue): global data center electricity (TWh/yr), 2020–2030 (IEA measured, central, and high scenarios with projection band) [1, 29]. Right axis (green, log): illustrative Φsystem proxy normalized to 2020=1, with step-changes anchored to documented system-level deployments (Epoch 2: FlashAttention, vLLM/PagedAttention, INT4/AWQ; Epoch 3: MLA, NSA, sparse-hybrid). Energy grows roughly linearly whil… view at source ↗

**Figure 3.** Figure 3: Architectural efficiency comparison across optimization strategies. Bars summarize reported [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Divergent trajectories of AI ecosystem archetypes. Path A (infrastructure-constrained) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

LLM inference is still evaluated mainly as a model or software problem: accuracy, latency, throughput, and hardware utilization. This is incomplete. At deployment scale, the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. We argue that the ML community should treat inference as \emph{energy-to-token production}. We formalize this view with a dimensionally consistent Token Production Function in which token rate is bounded by both compute-per-token and energy-per-token ceilings. Listed API prices vary by over an order of magnitude across providers, but we use price dispersion only as directional motivation, not as causal evidence of marginal cost. The core physical question is instead: under fixed quality and service targets, when does the binding constraint move from theoretical peak compute toward delivered power, cooling, and operational efficiency? Under this framing, system optimizations -- latent KV-cache compression, sparse or heavily compressed attention, quantization, routing, and difficulty-adaptive reasoning -- are not merely local engineering tricks. They are energy-to-token levers because they reduce FLOPs/token, joules/token, memory traffic, or utilization losses under fixed $(q^{*},s^{*})$. We therefore call for inference papers and benchmarks to report Joules/token, active binding constraint, PUE-adjusted delivered power, and utilization-adjusted token output alongside accuracy and latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper wants to reframe LLM inference evaluation around energy per token at scale, but offers no data showing when power or cooling actually becomes the binding constraint.

read the letter

The main thing to know is that this is a position paper arguing the ML community should treat inference as energy-to-token production rather than a pure compute or latency problem. They introduce a dimensionally consistent Token Production Function that bounds token rate by both compute-per-token and energy-per-token ceilings, and they push for benchmarks to report Joules/token, active binding constraint, PUE-adjusted power, and utilization-adjusted output alongside accuracy and latency. Optimizations like quantization, KV-cache compression, and sparse attention are recast as energy levers under fixed quality and service targets. That framing is coherent and highlights real deployment realities like cooling and power delivery that often get sidelined in model-centric papers.

Referee Report

1 major / 1 minor

Summary. The manuscript argues that LLM inference should be evaluated as energy-to-token production rather than primarily through model accuracy, latency, and hardware utilization metrics. It proposes a dimensionally consistent Token Production Function that bounds token production rates by both compute and energy constraints. The authors emphasize that at large deployment scales, factors such as delivered power, cooling capacity, PUE, and utilization become critical, and advocate for new reporting standards in inference papers and benchmarks, including Joules/token and active binding constraints.

Significance. If the central premise holds—that energy and power constraints become the binding limits on LLM inference at deployment scale—this position could substantially influence the ML community by redirecting optimization efforts and benchmarking practices toward energy efficiency and sustainable computing. It highlights the need to consider operational data center realities in model evaluation, potentially leading to more holistic system designs.

major comments (1)

The premise that the binding constraint on token production shifts from theoretical peak compute to delivered power, cooling, PUE, and operational efficiency at deployment scale lacks supporting measurements, utilization data, power traces, or scaling analysis from real deployments. This assumption is load-bearing for the justification of treating inference as energy-to-token production and reorienting benchmarks accordingly.

minor comments (1)

The Token Production Function is described as dimensionally consistent but no explicit mathematical formulation, equations, or derivation is provided, making it difficult to assess its novelty or applicability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive review and the opportunity to respond. We address the single major comment below.

read point-by-point responses

Referee: The premise that the binding constraint on token production shifts from theoretical peak compute to delivered power, cooling, PUE, and operational efficiency at deployment scale lacks supporting measurements, utilization data, power traces, or scaling analysis from real deployments. This assumption is load-bearing for the justification of treating inference as energy-to-token production and reorienting benchmarks accordingly.

Authors: We agree that the manuscript does not contain new empirical measurements, power traces, or utilization data from production deployments. As a position paper, the central contribution is the dimensionally consistent Token Production Function together with the recommendation to report Joules/token and active binding constraints; the premise that power and cooling become binding at scale is presented as a physically motivated hypothesis rather than a claim proven by new data. The text already qualifies API price dispersion as directional motivation only and does not assert causal evidence of marginal cost. We will make a partial revision by inserting a short paragraph that (a) explicitly labels the constraint-shift claim as a hypothesis grounded in known data-center characteristics (PUE > 1, finite rack power delivery, cooling limits) and (b) calls for future empirical studies to measure when and at what scale the binding constraint moves. This addition will clarify the evidential scope without altering the proposed evaluation framework. revision: partial

standing simulated objections not resolved

We cannot supply proprietary power traces, utilization statistics, or scaling measurements from commercial LLM deployments, as such operational data are not publicly available and we have no access to them.

Circularity Check

0 steps flagged

Conceptual position paper with no derivation chain or self-referential reductions

full rationale

The manuscript is a position paper that advocates reframing LLM inference evaluation around energy-to-token production and introduces a Token Production Function as a dimensional formalization. No equations, parameter fits, or step-by-step derivations appear in the provided text. The authors explicitly disclaim price dispersion as causal evidence and pose the binding-constraint question as open rather than solved by internal construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The argument therefore contains no steps that reduce by construction to their own inputs; it remains a self-contained conceptual proposal whose validity rests on external physical and operational data rather than circular redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that energy and power constraints dominate at deployment scale; no free parameters or invented entities are introduced.

axioms (1)

domain assumption At deployment scale, the binding constraint on LLM inference shifts from peak compute to delivered power, cooling, and operational efficiency.
This premise is invoked to justify why energy-to-token becomes the primary evaluation lens.

pith-pipeline@v0.9.0 · 5573 in / 1174 out tokens · 54020 ms · 2026-05-13T05:13:02.068428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 13 internal anchors

[1]

Energy and ai

International Energy Agency. Energy and ai. Technical report, IEA Special Report, 2025. URL https://www.iea.org/reports/energy-and-ai

work page 2025
[2]

Analyzing artificial intelligence and data center energy consumption

Electric Power Research Institute. Analyzing artificial intelligence and data center energy consumption. Technical Report 3002028905, EPRI, 2024. URL https://www.epri.com/r esearch/products/3002028905. EPRI White Paper No. 3002028905

work page arXiv 2024
[3]

Ai factories

NVIDIA. Ai factories. NVIDIA solutions page, 2026. URL https://www.nvidia.com/e n-us/solutions/ai-factories/

work page 2026
[4]

Ai inference

NVIDIA. Ai inference. NVIDIA solutions page, 2026. URL https://www.nvidia.com/e n-us/solutions/ai/inference/

work page 2026
[5]

Api pricing

OpenAI. Api pricing. OpenAI documentation, 2026. URL https://openai.com/api/pri cing/

work page 2026
[6]

Models overview and api pricing

Anthropic. Models overview and api pricing. Anthropic documentation, 2026. URL https: //docs.anthropic.com/en/docs/models-overview

work page 2026
[7]

Models and pricing

DeepSeek. Models and pricing. DeepSeek API documentation, 2026. URL https://api-d ocs.deepseek.com/quick_start/pricing

work page 2026
[8]

Schwartz, J

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green ai.Communications of the ACM, 63 (12):54–63, 2020. doi: 10.1145/3381831. URLhttps://doi.org/10.1145/3381831

work page doi:10.1145/3381831 2020
[9]

Patterson, J

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L. M. Munguia, D. Rothchild, J. Dean, et al. Carbon emissions and large neural network training, 2021. URL https://arxiv.org/abs/2104.1

work page 2021
[10]

arXiv preprint arXiv:2104.10350

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Patterson, J

D. Patterson, J. Gonzalez, U. Hölzle, Q. Le, C. Liang, L.-M. Munguia, J. Dean, et al. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022. doi: 10.1109/MC.2022.3148714. URLhttps://doi.org/10.1109/MC.2022.3148714

work page doi:10.1109/mc.2022.3148714 2022
[12]

C. J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, K. Hazelwood, et al. Sustainable ai: Environmental implications, challenges and opportunities. InProceedings of Machine Learning and Systems (MLSys), volume 4, pages 795–813, 2022. URL https: //arxiv.org/abs/2111.00364

work page arXiv 2022
[13]

Quantifying the Carbon Emissions of Machine Learning

A. Lacoste, A. Luccioni, V . Schmidt, and T. Dandres. Quantifying the carbon emissions of machine learning, 2019. URL https://arxiv.org/abs/1910.09700 . arXiv preprint arXiv:1910.09700

work page internal anchor Pith review arXiv 2019
[14]

In: Korhonen, A., Traum, D., Màrquez, L

E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in nlp. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3645–3650, 2019. URLhttps://arxiv.org/abs/1906.02243

work page Pith review arXiv 2019
[15]

A. S. Luccioni, Y . Jernite, and E. Strubell. Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (F AccT), 2024. doi: 10.1145/3630106.3658542. URL h t t p s : //doi.org/10.1145/3630106.3658542

work page doi:10.1145/3630106.3658542 2024
[16]

Mlperf inference v4.1 power results

MLCommons. Mlperf inference v4.1 power results. Technical report, MLCommons, 2024. URL https://mlcommons.org/benchmarks/inference-datacenter/ . MLCommons Technical Report. 10

work page 2024
[17]

W. W. Leontief.The Structure of American Economy, 1919–1929: An Empirical Application of Equilibrium Analysis. Harvard University Press, 1941

work page 1919
[18]

K. J. Arrow, H. B. Chenery, B. S. Minhas, and R. M. Solow. Capital-labor substitution and economic efficiency.The Review of Economics and Statistics, 43(3):225–250, 1961. doi: 10.2307/1927286. URLhttps://doi.org/10.2307/1927286

work page doi:10.2307/1927286 1961
[19]

2024 global data center survey results

Uptime Institute. 2024 global data center survey results. Technical report, Uptime Institute,

work page 2024
[20]

Global average PUE: 1.56; industry leaders: 1.08–1.09

work page
[21]

Samsi, D

S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, J. Kepner, et al. From words to watts: Benchmarking the energy costs of large language model inference, 2023. URL https://arxiv.org/abs/2310.03003. arXiv preprint arXiv:2310.03003

work page arXiv 2023
[22]

C. Niu, W. Zhang, J. Li, Y . Zhao, T. Wang, X. Wang, Y . Chen, et al. Tokenpowerbench: Benchmarking the power consumption of llm inference, 2025. URL https://arxiv.org/ab s/2512.03024. arXiv preprint arXiv:2512.03024

work page arXiv 2025
[23]

R. M. Solow. Technical change and the aggregate production function.Review of Economics and Statistics, 39(3):312–320, 1957. doi: 10.2307/1926047. URL https://doi.org/10.2 307/1926047

work page doi:10.2307/1926047 1957
[24]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009. doi: 10.1145/1498765.1498785. URLhttps://doi.org/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[25]

Sevilla and E

J. Sevilla and E. Roldán. Training compute of frontier ai models grows by 4-5x per year. Epoch AI Blog, 2024. URL https://epochai.org/blog/training-compute-of-frontier-a i-models-grows-by-4-5x-per-year

work page 2024
[26]

Chung, R

J.-W. Chung, R. Wu, J. J. Ma, and M. Chowdhury. Where do the joules go? diagnosing inference energy consumption, 2026. URL https://arxiv.org/abs/2601.22076 . arXiv preprint arXiv:2601.22076

work page arXiv 2026
[27]

Delavande, R

J. Delavande, R. Pierrard, and S. Luccioni. Understanding efficiency: Quantization, batching, and serving strategies in llm energy use, 2026. URL https://arxiv.org/abs/2601.22362. arXiv preprint arXiv:2601.22362

work page arXiv 2026
[28]

H. P. Cavagna, A. Proia, G. Madella, G. B. Esposito, F. Antici, D. Cesarini, Z. Kiziltan, and A. Bartolini. Sweetspot: An analytical model for predicting energy efficiency of llm inference,

work page
[29]

SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

URLhttps://arxiv.org/abs/2602.05695. arXiv preprint arXiv:2602.05695

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Nvidia h100 tensor core gpu: Product specifications

NVIDIA. Nvidia h100 tensor core gpu: Product specifications. NVIDIA product page, 2026. URLhttps://www.nvidia.com/en-us/data-center/h100/

work page 2026
[31]

Nvidia hgx platform specifications (hgx h100 4/8-gpu)

NVIDIA. Nvidia hgx platform specifications (hgx h100 4/8-gpu). NVIDIA product page, 2026. URLhttps://www.nvidia.com/en-us/data-center/hgx

work page 2026
[32]

Ai is set to drive surging electricity demand from data centres while offering the potential to transform how the energy sector works

International Energy Agency. Ai is set to drive surging electricity demand from data centres while offering the potential to transform how the energy sector works. IEA News, 2025. URL https://www.iea.org/news/ai-is-set-to-drive-surging-electricity-deman d-from-data-centres-while-offering-the-potential-to-transform-how-the -energy-sector-works

work page 2025
[33]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. Technical report, DeepSeek-AI, 2024. URL https://arxiv.org/abs/2405.04434. arXiv:2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, D. Amodei, et al. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361. arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, L. Sifre, et al. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/22 03.15556. arXiv preprint arXiv:2203.15556. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, G. Irving, et al. Scaling language models: Methods, analysis and insights from training gopher, 2021. URL https: //arxiv.org/abs/2112.11446. arXiv preprint arXiv:2112.11446

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Sevilla, L

J. Sevilla, L. Heim, A. Ho, T. Besiroglu, M. Hobbhahn, and P. Villalobos. Compute trends across three eras of machine learning. In2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2022. URLhttps://arxiv.org/abs/2202.05924

work page arXiv 2022
[38]

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022. URLhttps://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, I. Stoica, et al. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023. URL https: //arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. InProceedings of the 11th International Conference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InProceedings of Machine Learning and Systems, volume 6, 2024. URLhttps://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

URLhttps://arxiv.org/abs/2311.03687

work page arXiv
[44]

In Ku, L.-W., Martins, A

Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. LongGenBench: Long-context generation benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 865–883. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024 .findings-emnlp.48. URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.48

work page doi:10.18653/v1/2024 2024
[45]

Department of Energy

U.S. Department of Energy. Doe releases new report evaluating increase in electricity demand from data centers. Technical report, U.S. Department of Energy, 2024. URL https://www.en ergy.gov/articles/doe-releases-new-report-evaluating-increase-electrici ty-demand-data-centers. DOE News

work page 2024
[46]

Juniewicz

I. Juniewicz. Hyperscaler capex has quadrupled since gpt-4’s release. Epoch AI Data Insights,

work page
[47]

Combined Alphabet, Amazon, Meta, Microsoft, and Oracle capex extracted from SEC EDGAR 10-Q/10-K filings

URL https://epoch.ai/data- insights/hyperscaler- capex- trend/ . Combined Alphabet, Amazon, Meta, Microsoft, and Oracle capex extracted from SEC EDGAR 10-Q/10-K filings

work page
[48]

L. Liu. Speech at the china development forum 2026: Token (“ciyuan”) as the value anchor of the intelligent era; daily token-call volume in china exceeds 140 trillion as of march 2026. National Data Administration of China, 2026. URL https://www.nda.gov.cn/sjj/swdt/ mtsy/0325/20260325113132934906079_pc.html

work page 2026
[49]

Doubao surpasses 120 trillion daily tokens as usage doubles in three months

TechNode. Doubao surpasses 120 trillion daily tokens as usage doubles in three months. TechNode, 2026. URL https://technode.com/2026/04/07/doubao-surpasses-120 -trillion-daily-tokens-as-usage-doubles-in-three-months/

work page 2026
[50]

W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious.ACM SIGARCH Computer Architecture News, 23(1):20–24, 1995. doi: 10.1145/216585.216588. URLhttps://doi.org/10.1145/216585.216588

work page doi:10.1145/216585.216588 1995
[51]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, W. Zeng, et al. Native sparse attention: Hardware- aligned and natively trainable sparse attention. InProceedings of ACL 2025, 2025. URL https://arxiv.org/abs/2502.11089. Best Paper; arXiv:2502.11089. 12

work page arXiv 2025
[52]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, DeepSeek-AI, 2026. URL https://huggingface.co/deepseek-ai/De epSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical report

work page 2026
[53]

X. Liu, Z. Tang, P. Dong, Z. Li, Y . Liu, B. Li, X. Hu, and X. Chu. Chunkkv: Semantic- preserving kv cache compression for efficient long-context llm inference. InAdvances in Neural Information Processing Systems (NeurIPS) 39, 2025. URL https://arxiv.org/abs/2502 .00299. arXiv:2502.00299

work page arXiv 2025
[54]

arXiv preprint arXiv:2306.14048 , year=

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, B. Chen, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://arxiv.org/abs/2306.14048

work page arXiv 2023
[55]

FlowKV: Enhancing multi-turn conversational coherence in LLMs via isolated key-value cache management

Xiang Liu, Hong Chen, Xuming Hu, and Xiaowen Chu. FlowKV: Enhancing multi-turn conversational coherence in LLMs via isolated key-value cache management. InNeurIPS Workshop on Multi-Turn Interactions in Large Language Models, 2025. URL https://arxiv. org/abs/2505.15347

work page arXiv 2025
[56]

Semantic integrity matters: Benchmarking and preserving high-density reasoning in KV cache compression

Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, and Xiaowen Chu. Semantic integrity matters: Benchmarking and preserving high-density reasoning in KV cache compression. InInternational Conference on Machine Learning (ICML),

work page
[57]

URLhttps://arxiv.org/abs/2502.01941

work page internal anchor Pith review Pith/arXiv arXiv
[58]

SONIC: Segmented optimized nexus for information compression in key-value caching.arXiv preprint arXiv:2601.21927, 2026

Hong Chen, Xiang Liu, Bo Wang, Yuxuan Fan, Yuanlin Chu, Zongluo Li, Xiaowen Chu, and Xuming Hu. SONIC: Segmented optimized nexus for information compression in key-value caching.arXiv preprint arXiv:2601.21927, 2026. URL https://arxiv.org/abs/2601.2 1927

work page arXiv 2026
[59]

AnTKV: Anchor token-aware sub-bit vector quantization for KV cache in large language models.arXiv preprint arXiv:2506.19505, 2025

Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, and Xiaowen Chu. AnTKV: Anchor token-aware sub-bit vector quantization for KV cache in large language models.arXiv preprint arXiv:2506.19505, 2025. URL https: //arxiv.org/abs/2506.19505

work page arXiv 2025
[60]

Ora- cleKV: Oracle guidance for question-independent KV cache eviction

Yuanbing Zhu, Zhenheng Tang, Xiang Liu, Ang Li, Bo Li, Xiaowen Chu, and Bo Han. Ora- cleKV: Oracle guidance for question-independent KV cache eviction. InICML Workshop on Long-Context F oundation Models, 2025. URL https://openreview.net/pdf?id=KHM2YO GgX9

work page 2025
[61]

Sheng, L

Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, I. Stoica, et al. Flexgen: High- throughput generative inference of large language models with a single gpu. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 31094–31116, 2023. URLhttps://arxiv.org/abs/2303.06865

work page arXiv 2023
[62]

MiniMax, A. Li, B. Gong, B. Yang, B. Shan, et al. Minimax-01: Scaling foundation models with lightning attention, 2025. URL https://arxiv.org/abs/2501.08313. arXiv preprint arXiv:2501.08313

work page arXiv 2025
[63]

X. Liu, X. Hu, X. Chu, and E. Choi. Diffadapt: Difficulty-adaptive reasoning for token-efficient llm inference, 2025. URL https://arxiv.org/abs/2510.19669 . arXiv preprint arXiv:2510.19669

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Reasoning language model inference serving unveiled: An empirical study

Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, and Xiaowen Chu. Reasoning language model inference serving unveiled: An empirical study. InInternational Conference on Learning Representations (ICLR), 2026. URL https: //arxiv.org/abs/2510.18672

work page arXiv 2026
[65]

Can compressed llms truly act? an empirical evaluation of agentic capabilities in llm compression, arXiv preprint arXiv:2505.19433, 2025

Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, and Bo Li. Can compressed LLMs truly act? an empirical evaluation of agentic capabilities in LLM compression. In International Conference on Machine Learning (ICML), 2025. URL https://arxiv.org/ab s/2505.19433

work page arXiv 2025
[66]

The lottery LLM hypothesis, rethinking what abilities should LLM compression preserve? In ICLR Blogposts Track, 2025

Zhenheng Tang, Xiang Liu, Qian Wang, Peijie Dong, Bingsheng He, Xiaowen Chu, and Bo Li. The lottery LLM hypothesis, rethinking what abilities should LLM compression preserve? In ICLR Blogposts Track, 2025. URLhttps://arxiv.org/abs/2502.17535. 13

work page arXiv 2025
[67]

National energy administration releases 2024 national electric power industry statistics

National Energy Administration. National energy administration releases 2024 national electric power industry statistics. National Energy Administration of China, 2025. URL https: //www.nea.gov.cn/20250121/097bfd7c1cd3498897639857d86d5dac/c.html

work page arXiv 2024
[68]

Interpretation of the work plan for stabilizing growth in the power equipment industry (2025–2026)

Ministry of Industry and Information Technology, State Administration for Market Regulation, and National Energy Administration. Interpretation of the work plan for stabilizing growth in the power equipment industry (2025–2026). Technical report, Ministry of Industry and Information Technology, 2025. URL https://www.miit.gov.cn/zwgk/zcjd/art/2025 /art_44b...

work page 2025
[69]

Launch of second data centre – call for application

Infocomm Media Development Authority and Singapore Economic Development Board. Launch of second data centre – call for application. Technical report, IMDA / EDB, 2025. URL https://www.imda.gov.sg/resources/press-releases-factsheets-and-speeche s/factsheets/2025/launch-of-second-data-centre. IMDA / EDB Factsheet

work page 2025
[70]

State of ai: token-usage rankings, q1 2026

OpenRouter. State of ai: token-usage rankings, q1 2026. OpenRouter, 2026. URL https: //openrouter.ai/state-of-ai

work page 2026
[71]

2025 ai index report: Ai model performance gaps narrowing, compute costs plummeting

Stanford Institute for Human-Centered Artificial Intelligence. 2025 ai index report: Ai model performance gaps narrowing, compute costs plummeting. Technical report, Stanford HAI, 2025. URLhttps://aiindex.stanford.edu/report/

work page 2025
[72]

N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso. The computational limits of deep learning, 2020. URL https://arxiv.org/abs/2007.05558 . arXiv preprint arXiv:2007.05558

work page arXiv 2020
[73]

W. S. Jevons.The Coal Question: An Inquiry Concerning the Progress of the Nation, and the Probable Exhaustion of Our Coal-Mines. Macmillan and Co., London, 1865

work page
[74]

S. Sorrell. Jevons’ paradox revisited: The evidence for backfire from improved energy efficiency. Energy Policy, 37(4):1456–1469, 2009. doi: 10.1016/j.enpol.2008.12.003. URL https: //doi.org/10.1016/j.enpol.2008.12.003

work page doi:10.1016/j.enpol.2008.12.003 2009
[75]

Appenzeller

G. Appenzeller. Welcome to llmflation: Llm inference cost is going down fast. Andreessen Horowitz, 2024. URLhttps://a16z.com/llmflation-llm-inference-cost/

work page 2024
[76]

Demirer, A

M. Demirer, A. Fradkin, N. Tadelis, and S. Peng. The emerging market for intelligence: pricing, supply, and demand for llms. Technical Report 34608, National Bureau of Economic Research,

work page
[77]

NBER Working Paper No

URLhttps://www.nber.org/papers/w34608. NBER Working Paper No. 34608

work page
[78]

High cost of energy: industrial electricity prices in the eu vs the us and china

BusinessEurope. High cost of energy: industrial electricity prices in the eu vs the us and china. BusinessEurope Data Hub, 2024. URL https://www.businesseurope.eu/media-room/ data-hub/high-cost-of-energy/

work page 2024
[79]

Shapiro and H

C. Shapiro and H. R. Varian.Information Rules: A Strategic Guide to the Network Economy. Harvard Business School Press, 1999

work page 1999
[80]

Energy Information Administration

U.S. Energy Information Administration. Electric power monthly, table 5.6.b: Average price of electricity to ultimate customers by end-use sector, by state (december 2025 ytd). Technical report, U.S. Energy Information Administration, 2026. URL https://www.eia.gov/electr icity/monthly/epm_table_grapher.php?t=epmt_5_6_b

work page 2025
[81]

E. Wu. Sovereignty and data localization. Technical report, Belfer Center for Science and International Affairs, Harvard Kennedy School, 2021. URL https://www.belfercenter.o rg/publication/sovereignty-and-data-localization

work page 2021

Showing first 80 references.