KernelSight-LM: A Kernel-Level LLM Inference Simulator
Pith reviewed 2026-06-30 00:44 UTC · model grok-4.3
The pith
KernelSight-LM predicts per-kernel LLM inference latency on unseen GPU generations to 12.1% error with no target measurements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KernelSight-LM decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that captures prefix caching and continuous batching. The cross-generation tier uses only hardware specifications and kernel microbenchmarks from previously profiled GPUs to predict per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline of 22.0%. The target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU to reach 3.8% per-kernel error.
What carries the argument
Roofline kernel model with a learned efficiency term fitted from microbenchmarks, composed with communication and host-overhead models via a discrete-event scheduler.
If this is right
- End-to-end median errors reach 15.4% TTFT, 12.8% TPOT, and 3.0% throughput in cross-generation mode across six model families.
- Target-measured mode sharpens per-kernel error to 3.8% with a single model-agnostic microbenchmark sweep.
- Kernel-level bottleneck breakdowns directly support hardware and software co-design decisions.
- Both tiers require far less target-GPU data collection than prior profiling systems they extend.
Where Pith is reading between the lines
- The same decomposition could be tested on non-GPU accelerators if the roofline-plus-efficiency structure holds.
- Capacity planning tools could ingest the kernel breakdowns to estimate cluster sizing before hardware purchase.
- Serving policy designers could iterate on batching and caching rules using simulated traces instead of repeated deployments.
Load-bearing premise
The efficiency term learned from microbenchmarks on previously profiled GPUs generalizes to new GPU generations and diverse serving workloads without target-specific measurements or retraining.
What would settle it
Running the cross-generation tier on a new GPU generation and observing per-kernel prediction error that exceeds 12.1% or fails to beat the 22.0% roofline baseline by a substantial margin would falsify the central claim.
Figures
read the original abstract
As large language models (LLMs) move into production serving, practitioners must rapidly evaluate inference performance across diverse hardware, models, and serving parameters to meet cost and latency targets. However, the end-to-end behavior of LLMs couples serving-layer policies with low-level GPU kernel execution and rapidly evolving architectures, forcing slow, deployment-specific benchmarking that is hard to generalize. We present KernelSight-LM, a fine-grained inference simulator that models token-level execution and produces kernel-level latency breakdowns. It decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that also captures mechanisms like prefix caching and continuous batching. KernelSight-LM offers two prediction tiers that trade target-GPU data for accuracy. The cross-generation tier uses no target-GPU measurements, only hardware specifications and kernel microbenchmarks from previously profiled GPUs, and predicts per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline (22.0%). A second target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, sharpening per-kernel error to 3.8%, a 7.3x improvement over a comparable baseline (27.7%). Both tiers require far less target-GPU data than the prior systems they extend. In our simulator, these predictions yield end-to-end median (p50) errors across six model families of 15.4%, 12.8%, and 3.0% (TTFT, TPOT, throughput) in the cross-generation tier and 14.3%, 6.2%, and 2.7% in the target-measured tier, matching dedicated profiling tools while collecting far less on-device data. Beyond prediction, its kernel-level bottleneck breakdowns support hardware/software co-design and capacity planning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents KernelSight-LM, a simulator for LLM inference performance that models token-level execution by decomposing each serving step into a roofline-based kernel model augmented with a learned efficiency term, a communication model, and a host-overhead model. These are composed using a discrete-event scheduler that incorporates serving mechanisms such as prefix caching and continuous batching. The system provides two prediction tiers: a cross-generation tier that uses only hardware specifications and prior GPU microbenchmarks to predict per-kernel latencies on unseen GPU generations with 12.1% error (1.8× improvement over a 22.0% roofline baseline), and a target-measured tier that adds one model-agnostic microbenchmark sweep to achieve 3.8% error. End-to-end median errors across six model families are reported as 15.4%, 12.8%, and 3.0% for TTFT, TPOT, and throughput in the cross-generation tier, and 14.3%, 6.2%, and 2.7% in the target-measured tier.
Significance. If the generalization of the learned efficiency term holds, KernelSight-LM would provide a valuable tool for evaluating LLM inference across diverse hardware with minimal target-specific data collection, supporting hardware/software co-design and capacity planning in production serving environments. The kernel-level bottleneck breakdowns are a particular strength for identifying optimization opportunities. The paper is credited for extending roofline models with a learned term and discrete-event simulation to capture serving dynamics, and for reporting concrete quantitative improvements over baselines.
major comments (3)
- [Abstract] The abstract claims a 12.1% per-kernel error for the cross-generation tier using 'no target-GPU measurements' and 'kernel microbenchmarks from previously profiled GPUs'. No information is given on the number of source GPU generations used to derive the learned efficiency term, the exact fitting procedure, or validation across multiple target generations. This is load-bearing for the central claim of 1.8× improvement over the roofline baseline, as the error may reflect tuning rather than portable generalization.
- [§3 (Model Description)] The decomposition into roofline kernel, communication, and host-overhead models with the learned efficiency term is described, but the manuscript does not clarify if the efficiency term is a single scalar or incorporates additional parameters, nor how it is fitted without target data. This directly affects whether the cross-generation predictions are independent of the target architecture.
- [§5 (Evaluation)] The end-to-end results (15.4% TTFT, 12.8% TPOT, 3.0% throughput for cross-generation) are presented across six model families, but the section supplies no experimental setup details, baseline implementations, validation splits, or confirmation that the learned term was not fitted in a manner that inflates the cross-generation performance. This undermines assessment of the reported errors.
minor comments (2)
- [Abstract] The specific model families used in the evaluation are not named; providing their identities would aid reproducibility and context.
- [Throughout] The notation and definitions for the learned efficiency term and its integration into the roofline model could be more explicitly introduced early in the text to improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight areas where additional clarity on the cross-generation methodology and experimental setup would strengthen the manuscript. We address each major comment below and will incorporate revisions to provide the requested details without altering the core claims or results.
read point-by-point responses
-
Referee: [Abstract] The abstract claims a 12.1% per-kernel error for the cross-generation tier using 'no target-GPU measurements' and 'kernel microbenchmarks from previously profiled GPUs'. No information is given on the number of source GPU generations used to derive the learned efficiency term, the exact fitting procedure, or validation across multiple target generations. This is load-bearing for the central claim of 1.8× improvement over the roofline baseline, as the error may reflect tuning rather than portable generalization.
Authors: We agree the abstract would benefit from explicit mention of the source data scope to support the generalization claim. Section 3 of the manuscript details the use of microbenchmarks from multiple prior GPU generations to fit the efficiency term via regression, with validation on unseen target generations. We will revise the abstract to briefly note the source GPU generations and that fitting uses only prior data, preserving the reported 12.1% error and 1.8× improvement as measured on held-out targets. revision: yes
-
Referee: [§3 (Model Description)] The decomposition into roofline kernel, communication, and host-overhead models with the learned efficiency term is described, but the manuscript does not clarify if the efficiency term is a single scalar or incorporates additional parameters, nor how it is fitted without target data. This directly affects whether the cross-generation predictions are independent of the target architecture.
Authors: The efficiency term is implemented as a single scalar per kernel type, fitted exclusively on source GPU microbenchmarks by regressing the ratio of measured to roofline-predicted latency. No target-GPU data enters the fit, ensuring independence. We will add an explicit sentence in §3.2 stating the scalar nature and confirming the fitting procedure uses only prior-generation data. revision: yes
-
Referee: [§5 (Evaluation)] The end-to-end results (15.4% TTFT, 12.8% TPOT, 3.0% throughput for cross-generation) are presented across six model families, but the section supplies no experimental setup details, baseline implementations, validation splits, or confirmation that the learned term was not fitted in a manner that inflates the cross-generation performance. This undermines assessment of the reported errors.
Authors: We acknowledge that §5 would be strengthened by expanded setup details. The evaluation uses source-only fitting of the efficiency term with a validation split on unseen targets and generations; baselines are pure roofline models without the learned term. We will revise §5 to include the validation splits, baseline implementations, and explicit confirmation that fitting excludes all target data, ensuring the reported errors reflect true cross-generation performance. revision: yes
Circularity Check
No significant circularity; cross-generation prediction tests generalization of fitted efficiency term on unseen GPUs.
full rationale
The paper's core claim decomposes inference into a roofline model plus a learned efficiency term fitted exclusively on microbenchmarks from previously profiled GPUs, then applies the term (with only target hardware specifications) to predict latencies on entirely unseen GPU generations. This is a standard held-out generalization test rather than a self-definitional loop or a fitted parameter renamed as a prediction on the same data. No equations or self-citations are shown reducing the reported 12.1% error to the input microbenchmarks by construction; the baseline comparison (pure roofline) further isolates the contribution of the fitted term as an independent modeling choice. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned efficiency term
axioms (1)
- domain assumption LLM inference execution can be decomposed into independent roofline kernel, communication, and host-overhead models whose composition via discrete-event scheduling reproduces end-to-end behavior.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, et al . 2024. Phi-4 Technical Report. arXiv:2412.08905 [cs.CL] doi:10.48550/arXiv.2412.08905
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024
-
[2]
Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A Large-Scale Simulation Framework for Llm Inference.Proceedings of Machine Learning and Systems6 (2024), 351–366
2024
-
[3]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. arXiv:2403.02310 [cs.LG] doi:10.48550/arXiv.2403.02310
-
[4]
Gulavani, and Ramachandran Ramjee
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] doi:10. 48550/arXiv.2308.16369
Pith/arXiv arXiv 2023
-
[5]
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training Generalized Multi-Query Trans- former Models from Multi-Head Checkpoints.arXiv preprint arXiv:2305.13245 (2023). arXiv:2305.13245
Pith/arXiv arXiv 2023
-
[6]
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, et al. 2024. Deepseek Llm: Scaling Open-Source Language Models with Longtermism.arXiv preprint arXiv:2401.02954(2024). arXiv:2401.02954
Pith/arXiv arXiv 2024
-
[7]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. 2020. Language Models Are Few-Shot Learners. InProceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, 1877–1901
2020
-
[8]
Jaehong Cho, Hyunmin Choi, and Jongse Park. 2025. LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure.IEEE Computer Architecture Letters24, 2 (July 2025), 361–364. arXiv:2511.07229 [cs] doi:10.1109/LCA.2025.3628325
-
[9]
Tri Dao. 2023. Flashattention-2: Faster Attention with Better Parallelism and Work Partitioning.arXiv preprint arXiv:2307.08691(2023). arXiv:2307.08691
Pith/arXiv arXiv 2023
-
[10]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] doi:10.48550/arXiv.2205.14135
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022
-
[11]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, et al. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. arXiv:2403.19708 [cs.CL] doi:10.48550/arXiv.2403.19708
-
[12]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] doi:10.48550/arXiv.2311.04934
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024). arXiv:2407.21783
Pith/arXiv arXiv 2024
-
[14]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, et al. 2023. Text- books Are All You Need. arXiv:2306.11644 [cs.CL] doi:10.48550/arXiv.2306.11644
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.11644 2023
-
[15]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] doi:10.48550/arXiv.2401.14196
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024
-
[16]
Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, et al. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. InEighth Conference on Machine Learning and Systems
2025
-
[17]
Hugging Face. 2023. Text Generation Inference
2023
-
[18]
Saki Imai, Rina Nakazawa, Marcelo Amaral, Sunyanan Choochotkaew, and Tat- suhiro Chiba. 2024. Predicting LLM Inference Latency: A Roofline-Driven ML Method. InAnnual Conference on Neural Information Processing Systems
2024
-
[19]
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] doi:10.48550/arXiv.1804.06826
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1804.06826 2018
-
[20]
Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al
-
[21]
Mistral 7B. arXiv:2310.06825 [cs.CL] doi:10.48550/arXiv.2310.06825
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825
-
[22]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, et al. 2023. Efficient Memory Man- agement for Large Language Model Serving with Pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626
2023
-
[23]
Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU Performance for Deep Learning Training and Inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 493–508. arXiv:2407.13853 [cs] doi:10.1145/3669940.3707265
-
[24]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient Generative Inference of Large Language Models with Dynamic {KV} Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172
2024
-
[25]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive Nlp Tasks.Advances in neural information processing systems33 (2020), 9459–9474
2020
-
[26]
Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, and Fanny Nina Paravecino. 2025. APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving. arXiv:2411.17651 [cs] doi:10.48550/arXiv.2411.17651
-
[27]
Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien- Ming Huang. 2023. Llm-Powered Conversational Voice Assistants: Interaction Patterns, Opportunities, Challenges, and Design Guidelines.arXiv preprint arXiv:2309.13879(2023). arXiv:2309.13879
arXiv 2023
-
[28]
Microsoft, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, et al. 2025. Phi-4-Mini Tech- nical Report: Compact yet Powerful Multimodal Language Models via Mixture- of-LoRAs. arXiv:2503.01743 [cs.CL] doi:10.48550/arXiv.2503.01743
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.01743 2025
-
[29]
NVIDIA Corporation. 2023. Matrix Multiplication Background User’s Guide. NVIDIA Deep Learning Performance Documentation, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix- multiplication/index.html. Accessed: 2026-06-25
2023
-
[30]
NVIDIA Corporation. 2023. TensorRT-LLM: NVIDIA’s Inference Optimization Library
2023
-
[31]
NVIDIA Corporation. 2024. NCCL: Optimized Primitives for Collective Multi- GPU Communication. https://github.com/NVIDIA/nccl. Accessed: 2026-06-25
2024
-
[32]
NVIDIA Corporation. 2024. NVIDIA CUDA Toolkit, Version 12.x
2024
-
[33]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations.J. Parallel and Distrib. Comput.69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002
-
[34]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sympo- sium on Computer Architecture (ISCA). IEEE, Buenos Aires, Argentina, 118–132. doi:10.1109/ISCA59077.2024.00019 arXiv Prepr...
-
[35]
Sparks, and Ameet Talwalkar
Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. InInternational Conference on Learning Repre- sentations
2017
-
[36]
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC] doi:10.48550/arXiv.2407. 00079
-
[37]
Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] doi:10.48550/arXiv.2412.15115
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
-
[38]
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna
-
[39]
In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 81–92. doi:10.1109/ISPASS48437.2020.00018
-
[40]
James Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel. 2022. Torch. Fx: Practical Program Capture and Transformation for Deep Learning in Python.Proceedings of Machine Learning and Systems4 (2022), 638–651
2022
-
[41]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.Advances in Neural Information Processing Systems37 (2024), 68658–68685
2024
-
[42]
Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head Is All You Need. arXiv:1911.02150 [cs.NE] doi:10.48550/arXiv.1911.02150
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1911.02150 2019
-
[43]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] doi:10. 48550/arXiv.1909.08053
Pith/arXiv arXiv 2020
-
[44]
Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2023. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors. IEEE Transactions on Parallel and Distributed Systems34, 1 (Jan. 2023), 246–261. arXiv:2206.02874 [cs.AR] doi:10.1109/TPDS.2022.3217824
-
[45]
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cas- sidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, et al
-
[46]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 [cs.CL] doi:10.48550/arXiv.2408.00118
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118
-
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs] doi:10.48550/arXiv.1706.03762
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
-
[48]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures.Commun. ACM 52, 4 (April 2009), 65–76. doi:10.1145/1498765.1498785
-
[49]
William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 283–294. arXiv:2303.14006 [cs] doi:10.1109/ISPA...
-
[50]
Feiyang Wu, Zhuohang Bian, Guoyang Duan, Tianle Xu, Junchi Wu, Teng Ma, Yongqiang Yao, Ruihao Gong, et al . 2025. TokenSim: Enabling Hard- ware and Software Exploration for Large Language Model Inference Systems. arXiv:2503.08415 [cs] doi:10.48550/arXiv.2503.08415
-
[51]
Tianhao Xu, Yiming Liu, Xianglong Lu, Yijia Zhao, Xuting Zhou, Aichen Feng, Yiyi Chen, Yi Shen, et al. 2026. AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving. arXiv:2601.06288 [cs.LG] doi:10. 48550/arXiv.2601.06288
arXiv 2026
-
[52]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] doi:10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[53]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538. https://www.usenix.org/conference/osdi22/ presentation/yu
2022
-
[54]
Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko
Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habi- tat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 503–521
2021
-
[55]
Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. Nn-Meter: Towards Accurate Latency Prediction of Deep- Learning Model Inference on Diverse Edge Devices. InProceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. ACM, Virtual Event Wisconsin, 81–93. doi:10.1145...
-
[56]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, et al . 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI] doi:10.48550/arXiv.2312.07104
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2024
-
[57]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv:2401.09670 [cs.DC] doi:10.48550/arXiv.2401.09670 KernelSight-LM: A Kernel-Level LLM Inference Simulator arXiv Preprint, June, 2026, A Experimental Se...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.