arxiv: 2604.09611 · v1 · submitted 2026-03-12 · 💻 cs.DC · cs.AI

Recognition: no theorem link

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

Md. Monzurul Amin Ifath , Israat Haque

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:25 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM inferencemulti-request workflowsenergy efficiencybatch sizeperformance trade-offsGPU power cappingvLLMserving systems

0 comments

The pith

Batch size is the strongest lever for trading off latency and energy in multi-request LLM workflows, but only when the workload shares large prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to map how performance and energy interact when LLMs handle sequences of dependent requests rather than isolated queries. It builds four concrete workloads that mirror sequential summarization, interactive copilots, multi-agent coding, and mixed patterns, then measures them on an A100 testbed running vLLM and Parrot. The central finding is that batch size produces the largest changes in both latency and energy, yet the same setting can cut energy sharply in one workload while leaving another almost unchanged. Power capping yields smaller, more predictable savings, while output length scales energy roughly linearly. The measurements also show that vLLM keeps GPU utilization high on decode-heavy tasks and Parrot saves extra energy when power is strictly limited.

Core claim

We present the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. Four workloads capture sequential, interactive, agentic, and composite patterns. On an NVIDIA A100 with vLLM and Parrot, batch size emerges as the dominant knob, yet its benefits are workload-dependent: it improves efficiency for tasks with large shared prompts, shows no gain for sequential summarization, and gives only partial improvement for multi-agent coding. GPU power capping delivers modest but consistent savings, output length produces linear energy growth with little efficiency upside, vLLM sustains higher utilization on decode-heavy work, and Parrot's workflow-aware调度trt

What carries the argument

Four representative workloads together with the three energy-control knobs (batch size, GPU power cap, output length) measured across vLLM and Parrot serving engines.

If this is right

Workloads whose requests share long common prefixes can cut both energy and latency by raising batch size without extra hardware.
Sequential summarization pipelines gain almost nothing from batch-size tuning and must rely on other levers such as power capping.
GPU power capping gives modest, predictable energy reductions that scale across all four workload types.
vLLM-style continuous batching keeps GPU utilization high when decode phases dominate, improving energy per token.
Parrot-style workflow-aware scheduling lowers total energy when the system is forced to run under a tight power budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production serving systems could add lightweight prompt-sharing detectors so they can raise batch size only when the workload will actually benefit.
Similar characterization studies on other accelerators or at the edge would reveal whether the same workload-dependent pattern holds outside the A100 and vLLM/Parrot pair.
Workflow schedulers that already track request dependencies could be extended to choose serving engines or power caps on a per-workflow basis rather than globally.

Load-bearing premise

The four chosen workloads are representative of the sequential, interactive, agentic, and composite patterns that actually occur in production multi-request LLM deployments.

What would settle it

Re-running the exact same measurement suite on a fresh set of workloads whose prompt-sharing statistics or interaction structure differ markedly from the original four and finding that batch size no longer produces the largest changes in energy or latency.

Figures

Figures reproduced from arXiv: 2604.09611 by Israat Haque, Md. Monzurul Amin Ifath.

**Figure 2.** Figure 2: Representative multi-request LLM workflows in (a) Document chain summarization, (b) LLM-powered search, and (c) Multi [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: RQ1 results on the LLM-powered search workload. Panels (a-b) sweep output length, (c-d) sweep batch size, and (e-f ) sweep GPU power cap, with all other knobs fixed. Each panel uses dual Y-axes with normalized total energy in left vs. normalized performance (latency/throughput) in right Y axis. Figure 3a shows the effect of varying output length (200-800 tokens) where increasing the output length significa… view at source ↗

**Figure 4.** Figure 4: a) Energy per token vs. batch size under 16 concurrent users. Lines show mean value; shaded regions are 95% confidence intervals. Lower is better. The X markers denote batch sizes that are infeasible (OOM). b) Prefix cache hit rate distribution across workloads and batch sizes (1–8). Boxplots show medians and variability. Multi-agent coding shows more complex behavior as the lowest energy per token occurs … view at source ↗

**Figure 5.** Figure 5: RQ3 results on chain summarization workload showing impact of input length, output length, batch size, and GPU power [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: GPU utilization comparison between vLLM and Par [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored. To address these gaps, this paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. We develop four representative workloads capturing sequential, interactive, agentic, and composite patterns common in modern deployments. Using an NVIDIA A100 testbed with state-of-the-art serving systems (vLLM and Parrot), we analyze how key energy knobs affect latency, throughput, and component-level energy use. Our findings reveal batch size as the most impactful lever, though benefits are workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for multi-agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further show that engine-level optimizations in vLLM maintain higher GPU utilization and efficiency, especially for decode-heavy workloads, while Parrot's workflow-aware scheduling achieves lower energy consumption under strict power constraints. These findings offer actionable guidelines for developers and system operators designing performance- and energy-aware LLM serving systems in emerging multi-request workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Batch size is the top energy lever for multi-request LLM workflows, but gains depend on prompt sharing.

read the letter

The main thing to know is that batch size is the strongest lever for cutting energy in multi-request LLM workflows, but the benefit depends on how much the requests share prompts. It helps shared-prompt cases but does little for sequential summarization and only partially for multi-agent coding. This paper is new in shifting the focus from single-request inference to chained workflows. The authors define four workload patterns for sequential, interactive, agentic, and composite flows. They run them on an NVIDIA A100 using vLLM and Parrot, measuring how batch size, power capping, and output length affect latency, throughput, and energy. The measurements are useful because they are concrete and show engine differences: vLLM holds higher utilization on decode-heavy loads while Parrot saves more energy under power constraints. These give some practical tuning ideas for operators. The soft spot is the workload representativeness. The paper labels the four patterns as representative without any comparison to real production traces or statistics on sharing or request structure. If the test cases miss certain patterns like high fan-out, the ranking of knobs and the specific claims about ineffectiveness could be tied to the chosen examples rather than general. The results are also presented without error bars or statistical tests, keeping them directional. This is for people who run LLM serving systems and care about energy in agentic or copilot applications. A reader looking for guidelines on energy knobs in workflow settings would find it worth reading. It deserves peer review because the measurements fill a gap in workflow-level data and the testbed is described well enough to check. I would recommend sending it to referees, with notes to add validation for the workloads and some basic stats.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the first systematic empirical characterization of performance-energy trade-offs in multi-request LLM inference workflows. It introduces four synthetic workloads (sequential summarization, interactive, agentic coding, and composite) evaluated on an NVIDIA A100 testbed with vLLM and Parrot serving systems, reporting that batch size is the dominant optimization lever with workload-dependent benefits, GPU power capping yields modest predictable savings, output length drives linear energy scaling, and engine-specific optimizations affect utilization and efficiency differently under power constraints.

Significance. If the workload representativeness and measurement rigor hold, the work is significant for filling the gap between single-request LLM benchmarks and real multi-request deployments; the concrete identification of batch size as the primary lever and the engine comparisons provide actionable guidelines for energy-aware serving system design in applications like multi-agent systems.

major comments (2)

[Workload Definition] Workload Definition: The four workloads are asserted to capture 'sequential, interactive, agentic, and composite patterns common in modern deployments,' yet the manuscript supplies no quantitative validation (e.g., prompt-sharing statistics, request-graph metrics, or output-length distributions) against production traces. This assumption is load-bearing for the headline claim that 'optimal batching benefits workloads with large shared prompts' while being 'ineffective for sequential summarization' and 'only partially effective for multi-agent coding.'
[Results and Evaluation] Results and Evaluation: Directional findings on batch-size impact and workload dependence are reported without error bars, number of repetitions, statistical significance tests, or raw-data release. The absence of these elements makes it impossible to assess whether observed differences (e.g., 'ineffective for sequential summarization') exceed measurement noise, weakening the reliability of the performance-energy trade-off conclusions.

minor comments (1)

[Abstract] The abstract and introduction could more precisely define the energy measurement methodology (e.g., which GPU components are instrumented and the sampling interval) to allow readers to judge the component-level claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps improve the clarity and rigor of our work. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Workload Definition] Workload Definition: The four workloads are asserted to capture 'sequential, interactive, agentic, and composite patterns common in modern deployments,' yet the manuscript supplies no quantitative validation (e.g., prompt-sharing statistics, request-graph metrics, or output-length distributions) against production traces. This assumption is load-bearing for the headline claim that 'optimal batching benefits workloads with large shared prompts' while being 'ineffective for sequential summarization' and 'only partially effective for multi-agent coding.'

Authors: We acknowledge that the workloads are synthetic and that we lack access to proprietary production traces for direct quantitative validation. The workload parameters were derived from patterns and statistics reported in the open literature on multi-agent LLM systems and serving workloads. In the revised manuscript, we will expand Section 3 to include explicit citations to these sources, provide a table comparing our chosen parameters (e.g., prompt-sharing ratios, output-length distributions) to published values, and add a limitations paragraph stating that the results illustrate trade-offs under representative conditions rather than claiming exact replication of any specific production deployment. This addresses the concern while preserving the core contribution. revision: partial
Referee: [Results and Evaluation] Results and Evaluation: Directional findings on batch-size impact and workload dependence are reported without error bars, number of repetitions, statistical significance tests, or raw-data release. The absence of these elements makes it impossible to assess whether observed differences (e.g., 'ineffective for sequential summarization') exceed measurement noise, weakening the reliability of the performance-energy trade-off conclusions.

Authors: We agree that statistical rigor should be strengthened. Each configuration was executed five times; we will report this number, add error bars (standard deviation) to all figures, include t-test results for the key workload-dependent differences highlighted in the text, and release the raw measurement data via a public repository (with a DOI link added to the camera-ready version). These changes will allow readers to evaluate whether the reported differences exceed measurement variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: pure empirical measurement study

full rationale

The paper is a characterization study that defines four workloads explicitly as representative examples of sequential, interactive, agentic, and composite patterns, then reports direct measurements of latency, throughput, and energy on an NVIDIA A100 testbed using vLLM and Parrot. No equations, fitted parameters, predictions, or derivations appear that reduce by construction to quantities defined inside the paper. Workload definitions and findings are presented as constructed test cases and observed results rather than self-referential outputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The central claims rest on external hardware measurements and are therefore self-contained against the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on empirical observations from a specific hardware and software testbed; no mathematical derivations, free parameters, or new postulated entities are introduced.

axioms (1)

domain assumption The four workloads are representative of common multi-request patterns in modern deployments
Invoked when the authors state that the workloads capture sequential, interactive, agentic, and composite patterns.

pith-pipeline@v0.9.0 · 5588 in / 1245 out tokens · 33458 ms · 2026-05-15T12:25:51.733428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 6 internal anchors

[1]

https://fastapi.tiangolo.com/, 2025

Fastapi. https://fastapi.tiangolo.com/, 2025

work page 2025
[2]

https://github.com/langchain-ai/langchain, 2025

Langchain. https://github.com/langchain-ai/langchain, 2025

work page 2025
[3]

https://www.llamaindex.ai/, 2025

Llamaindex - build knowledge assistants over your enterprise data. https://www.llamaindex.ai/, 2025

work page 2025
[4]

On evaluating performance of llm inference serving systems.arXiv preprint arXiv:2507.09019, 2025

Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, and Alexey Tumanov. On evaluating performance of llm inference serving systems.arXiv preprint arXiv:2507.09019, 2025

work page arXiv 2025
[5]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117–134, 2024

work page 2024
[6]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Llm in a flash: Efficient large language model inference with limited memory

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584, 2024

work page 2024
[8]

Amd smi 26.0.2 documentation, 2025

AMD. Amd smi 26.0.2 documentation, 2025

work page 2025
[9]

Measuring and improving the energy efficiency of large language models inference.IEEE Access, 12:80194–80207, 2024

Mauricio Fadel Argerich and Marta Patiño-Martínez. Measuring and improving the energy efficiency of large language models inference.IEEE Access, 12:80194–80207, 2024

work page 2024
[10]

A programming framework for agentic ai, 2025

AutoGen. A programming framework for agentic ai, 2025

work page 2025
[11]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems, 37:12461–12495, 2024

Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems, 37:12461–12495, 2024

work page 2024
[12]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[14]

Murakkab: Resource-efficient agentic workflow orchestration in cloud platforms.arXiv preprint arXiv:2508.18298, 2025

Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, and Ricardo Bianchini. Murakkab: Resource-efficient agentic workflow orchestration in cloud platforms.arXiv preprint arXiv:2508.18298, 2025

work page arXiv 2025
[15]

Kairos: Low-latency multi-agent serving with shared llms and excessive loads in the public cloud.arXiv preprint arXiv:2508.06948, 2025

Jinyuan Chen, Jiuchen Shi, Quan Chen, and Minyi Guo. Kairos: Low-latency multi-agent serving with shared llms and excessive loads in the public cloud.arXiv preprint arXiv:2508.06948, 2025

work page arXiv 2025
[16]

Are more llm calls all you need? towards the scaling properties of compound ai systems.Advances in Neural Information Processing Systems, 37:45767–45790, 2024

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A Zaharia, and James Y Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems.Advances in Neural Information Processing Systems, 37:45767–45790, 2024

work page 2024
[17]

Llm-inference-bench: Inference benchmarking of large language models on ai accelerators

Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, and Venkatram Vishwanath. Llm-inference-bench: Inference benchmarking of large language models on ai accelerators. InSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Stor...

work page 2024
[18]

Nvidia a100 tensor core gpu: Performance and innovation

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2):29–35, 2021

work page 2021
[19]

de Araújo, JPW, and MinervaBooks

Benoit Courty, Victor Schmidt, Sasha Luccioni, Goyal-Kamal, MarionCoutarel, Boris Feld, Jérémy Lecourt, LiamConnell, Amine Saboni, Inimaz, supatomic, Mathilde Léval, Luis Blanche, Alexis Cruveiller, ouminasara, Franklin Zhao, Aditya Joshi, Alexis Bogroff, Hugues de Lavoreille, Niko Characterizing Performance–Energy Trade-offs of Large Language Models in M...

work page 2024
[20]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022
[21]

Rapl: Memory power estimation and capping

Howard David, Eugene Gorbatov, Ulf R Hanebutte, Rahul Khanna, and Christian Le. Rapl: Memory power estimation and capping. InProceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, pages 189–194, 2010

work page 2010
[22]

Energy considerations of large language model inference and efficiency optimizations

Jared Fernandez, Clara Na, Vashisth Tiwari, Yonatan Bisk, Sasha Luccioni, and Emma Strubell. Energy considerations of large language model inference and efficiency optimizations. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V...

work page 2025
[23]

Estimation of energy consumption in machine learning.Journal of Parallel and Distributed Computing, 134:75–88, 2019

Eva García-Martín, Crefeda Faviola Rodrigues, Graham Riley, and Håkan Grahn. Estimation of energy consumption in machine learning.Journal of Parallel and Distributed Computing, 134:75–88, 2019

work page 2019
[24]

Green ai: Do deep learning frameworks have different costs? In Proceedings of the 44th International Conference on Software Engineering, pages 1082–1094, 2022

Stefanos Georgiou, Maria Kechagia, Tushar Sharma, Federica Sarro, and Ying Zou. Green ai: Do deep learning frameworks have different costs? In Proceedings of the 44th International Conference on Software Engineering, pages 1082–1094, 2022

work page 2022
[25]

ggml-org/llama.cpp: Llm inference in c/c++, 2025

ggml-org. ggml-org/llama.cpp: Llm inference in c/c++, 2025

work page 2025
[26]

Gemma open models, 2025

Google. Gemma open models, 2025

work page 2025
[27]

Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671, 2024

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671, 2024

work page arXiv 2024
[28]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[29]

Wattsonai: Measuring, analyzing, and visualizing energy and carbon footprint of ai workloads.arXiv preprint arXiv:2506.20535, 2025

Hongzhen Huang, Kunming Zhang, Hanlong Liao, Kui Wu, and Guoming Tang. Wattsonai: Measuring, analyzing, and visualizing energy and carbon footprint of ai workloads.arXiv preprint arXiv:2506.20535, 2025

work page arXiv 2025
[30]

huggingface/text-generation-inference: Large language model text generation inference, 2025

huggingface. huggingface/text-generation-inference: Large language model text generation inference, 2025

work page 2025
[31]

meta-llama/llama-2-7b, 2025

HuggingFace. meta-llama/llama-2-7b, 2025

work page 2025
[32]

Energy efficiency 2024, 2024

International Energy Agency (IEA). Energy efficiency 2024, 2024. Licence: CC BY 4.0

work page 2024
[33]

Energy efficiency 2025, 2025

International Energy Agency (IEA). Energy efficiency 2025, 2025

work page 2025
[34]

ml-energy/zeus: Measure and optimize the energy consumption of your ai applications!, 2025

Jie You and Jae-Won Chung and Mosharaf Chowdhury. ml-energy/zeus: Measure and optimize the energy consumption of your ai applications!, 2025

work page 2025
[35]

Slo-aware gpu dvfs for energy-efficient llm inference serving.IEEE Computer Architecture Letters, 23(2):150–153, 2024

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. Slo-aware gpu dvfs for energy-efficient llm inference serving.IEEE Computer Architecture Letters, 23(2):150–153, 2024

work page 2024
[36]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[37]

Energy consumption of neural networks on nvidia edge boards: an empirical model

Seyyidahmed Lahmer, Aria Khoshsirat, Michele Rossi, and Andrea Zanella. Energy consumption of neural networks on nvidia edge boards: an empirical model. In2022 20th international symposium on modeling and optimization in mobile, ad hoc, and wireless networks (WiOpt), pages 365–371. IEEE, 2022

work page 2022
[38]

Llm inference serving: Survey of recent advances and opportunities

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of recent advances and opportunities. In2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. IEEE, 2024

work page 2024
[39]

Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering.arXiv preprint arXiv:2304.12102, 2023

Yucheng Li. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering.arXiv preprint arXiv:2304.12102, 2023

work page arXiv 2023
[40]

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

work page 2023
[41]

Parrot: Efficient serving of {LLM-based} applications with semantic variable

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, 2024

work page 2024
[42]

Bullet: Boosting gpu utilization for llm serving via dynamic spatial-temporal orchestration.arXiv preprint arXiv:2504.19516, 2025

Zejia Lin, Hongxin Xu, Guanyi Chen, Xianwei Zhang, and Yutong Lu. Bullet: Boosting gpu utilization for llm serving via dynamic spatial-temporal orchestration.arXiv preprint arXiv:2504.19516, 2025

work page arXiv 2025
[43]

emnlp-main.1742/

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

work page arXiv 2025
[45]

Investigating energy efficiency and performance trade-offs in llm inference across tasks and dvfs settings.arXiv preprint arXiv:2501.08219, 2025

Paul Joe Maliakel, Shashikant Ilager, and Ivona Brandic. Investigating energy efficiency and performance trade-offs in llm inference across tasks and dvfs settings.arXiv preprint arXiv:2501.08219, 2025

work page arXiv 2025
[46]

The multi-agent framework, 2025

MetaGPT. The multi-agent framework, 2025. 24 Md. Monzurul Amin Ifath and Israat Haque

work page 2025
[47]

Microsoft 365 copilot, 2025

Microsoft. Microsoft 365 copilot, 2025

work page 2025
[48]

microsoft/parrotserve: [osdi’24] serving llm-based applications efficiently with semantic variable, 2025

Microsoft. microsoft/parrotserve: [osdi’24] serving llm-based applications efficiently with semantic variable, 2025

work page 2025
[49]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

System management interface smi, 2025

NVIDIA Developer. System management interface smi, 2025

work page 2025
[51]

Get up and running with openai gpt-oss, deepseek-r1, gemma 3 and other models., 2025

Ollama. Get up and running with openai gpt-oss, deepseek-r1, gemma 3 and other models., 2025

work page 2025
[52]

vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for llms, 2025

Open Collective. vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for llms, 2025

work page 2025
[53]

vllm v1 user guide — vllm, 2025

Open Collective. vllm v1 user guide — vllm, 2025

work page 2025
[54]

Characterizing power management opportunities for llms in the cloud

Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini. Characterizing power management opportunities for llms in the cloud. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 207–222, 2024

work page 2024
[55]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

work page 2024
[56]

Tu (r) ning ai green: Exploring energy efficiency cascading with orthogonal optimizations

Saurabhsingh Rajput, Mootez Saad, and Tushar Sharma. Tu (r) ning ai green: Exploring energy efficiency cascading with orthogonal optimizations. arXiv preprint arXiv:2506.18289, 2025

work page arXiv 2025
[57]

Enhancing energy-awareness in deep learning through fine-grained energy measurement.ACM Transactions on Software Engineering and Methodology, 33(8):1–34, 2024

Saurabhsingh Rajput, Tim Widmayer, Ziyuan Shang, Maria Kechagia, Federica Sarro, and Tushar Sharma. Enhancing energy-awareness in deep learning through fine-grained energy measurement.ACM Transactions on Software Engineering and Methodology, 33(8):1–34, 2024

work page 2024
[58]

From words to watts: Benchmarking the energy costs of large language model inference

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. From words to watts: Benchmarking the energy costs of large language model inference. In2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE, 2023

work page 2023
[59]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023
[60]

sgl-project/sglang: Sglang is a fast serving framework for large language models and vision language models., 2025

sglang.ai. sgl-project/sglang: Sglang is a fast serving framework for large language models and vision language models., 2025

work page 2025
[61]

Sharegpt dataset, 2025

ShareGPT team. Sharegpt dataset, 2025

work page 2025
[62]

The sparsely-gated mixture-of-experts layer.Outrageously large neural networks, 2, 2017

N Shazeer, A Mirhoseini, K Maziarz, A Davis, Q Le, G Hinton, and J Dean. The sparsely-gated mixture-of-experts layer.Outrageously large neural networks, 2, 2017

work page 2017
[63]

On the sustainability of ai inferences in the edge.arXiv preprint arXiv:2507.23093, 2025

Ghazal Sobhani, Md Monzurul Amin Ifath, Tushar Sharma, and Israat Haque. On the sustainability of ai inferences in the edge.arXiv preprint arXiv:2507.23093, 2025

work page arXiv 2025
[64]

Towards greener llms: Bringing energy-efficiency to the forefront of llm inference

Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Íñigo Goiri, and Josep Torrellas. Towards greener llms: Bringing energy-efficiency to the forefront of llm inference. InEMC2 at ASPLOS, April 2024

work page 2024
[65]

Dynamollm: Designing llm inference clusters for performance and energy efficiency

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362. IEEE, 2025

work page 2025
[66]

Efficient and workload-aware llm serving via runtime layer swapping and kv cache resizing.arXiv preprint arXiv:2506.02006, 2025

Zhaoyuan Su, Tingfeng Lan, Zirui Wang, Juncheng Yang, and Yue Cheng. Efficient and workload-aware llm serving via runtime layer swapping and kv cache resizing.arXiv preprint arXiv:2506.02006, 2025

work page arXiv 2025
[67]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

work page 2024
[68]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Unveiling energy efficiency in deep learning: Measurement, prediction, and scoring across edge devices

Xiaolong Tu, Anik Mallik, Dawei Chen, Kyungtae Han, Onur Altintas, Haoxin Wang, and Jiang Xie. Unveiling energy efficiency in deep learning: Measurement, prediction, and scoring across edge devices. InProceedings of the Eighth ACM/IEEE Symposium on Edge Computing, pages 80–93, 2023

work page 2023
[70]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[71]

Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems.ACM SIGENERGY Energy Informatics Review, 4(5):113–119, 2024

Grant Wilkins, Srinivasan Keshav, and Richard Mortier. Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems.ACM SIGENERGY Energy Informatics Review, 4(5):113–119, 2024

work page 2024
[72]

Sustainable ai: Environmental implications, challenges and opportunities.Proceedings of machine learning and systems, 4:795–813, 2022

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmental implications, challenges and opportunities.Proceedings of machine learning and systems, 4:795–813, 2022

work page 2022
[73]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework.arXiv preprint arXiv:2308.08155, 3(4), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Zeus: Understanding and optimizing {GPU } energy consumption of {DNN} training

Jie You, Jae-Won Chung, and Mosharaf Chowdhury. Zeus: Understanding and optimizing {GPU } energy consumption of {DNN} training. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 119–139, 2023

work page 2023
[75]

Orca: A distributed serving system for{Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for{Transformer-Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022

work page 2022
[76]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[77]

Llm-enhanced data management.arXiv preprint arXiv:2402.02643, 2024

Xuanhe Zhou, Xinyang Zhao, and Guoliang Li. Llm-enhanced data management.arXiv preprint arXiv:2402.02643, 2024. Characterizing Performance–Energy Trade-offs of Large Language Models in Multi-Request Workflows 25

work page arXiv 2024
[78]

Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

work page arXiv 2024