arxiv: 2605.07985 · v1 · submitted 2026-05-08 · 💻 cs.DC · cs.AI

Recognition: no theorem link

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Anoop Rachakonda, Daehyeok Kim, Geon-Woo Kim, Joon Ha Kim

Pith reviewed 2026-05-11 02:37 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM inferenceprofilingsimulationtaint propagationlatency modelingredundancy reductionconfiguration exploration

0 comments

The pith

Dooly reuses profiled LLM operation latencies across model configurations by tracing input-dimension origins in a single inference pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dooly tackles the repeated full profiling that current simulators require for each new combination of model, hardware, and attention backend. It executes one inference run, applies taint propagation to mark whether each operation input comes from fixed model settings or from the request, and profiles only the operations absent from its growing latency store. Regression models fitted to that store then replace the profiling step inside existing simulators. Across twelve models, two GPU platforms, and three backends the resulting predictions stay within 5 percent MAPE on time-to-first-token and 8 percent on time-per-output-token while using 56.4 percent fewer GPU-hours than exhaustive per-configuration profiling.

Core claim

Dooly performs a single inference pass with taint propagation to label each operation's input dimensions by their origin in the model configuration or the incoming request. It selectively profiles only those operations absent from a growing latency database, reuses the serving engine's initialization for stateful operations like attention, and constructs regression models from the collected data. These models act as a drop-in replacement for the profiling component in existing simulators, delivering mean absolute percentage errors below 5% for time-to-first-token and below 8% for time-per-output-token while requiring 56.4% fewer profiling GPU-hours across diverse models, platforms, and back-

What carries the argument

Taint propagation during one inference execution to classify each input dimension's origin, enabling selective profiling and reuse of a latency database plus regression models for unprofiled cases.

If this is right

Existing simulators can adopt the latency database without re-profiling for each new model configuration.
Profiling effort decreases as the database accumulates data from multiple models due to shared dimensions.
Prediction accuracy remains high across GPU platforms, attention backends, and model architectures without additional tuning.
Stateful operations are handled automatically by reusing engine initialization code rather than custom instrumentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This reuse pattern could lower the barrier to testing many serving configurations during research and production tuning.
Similar dimension-origin tracking might apply to profiling other neural-network inference workloads beyond transformers.
A shared community latency database could further amortize costs across independent users.

Load-bearing premise

A single taint-labeled inference pass is enough to identify every reusable operation across arbitrary model configurations and the resulting latency database plus regression models generalize without hidden dependencies.

What would settle it

Run Dooly on a previously unseen model configuration, feed its latency database to a simulator, and compare the predicted TTFT and TPOT against direct hardware measurements; any deviation exceeding 8 percent MAPE on TPOT would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07985 by Anoop Rachakonda, Daehyeok Kim, Geon-Woo Kim, Joon Ha Kim.

**Figure 1.** Figure 1: Performance variation across configurations. TTFT and TPOT are reported for prefill and decode workloads. Red lines mark winner inversions; ties are within 0.5% of the best (FA: FlashAttention, FI: FlashInfer, TRI: TritonAttention). No single configuration performs optimally across the full H ×S ×M ×W space. We demonstrate this empirically using two similar-sized models, Llama-3.1-8B [25] and CommandR7B… view at source ↗

**Figure 2.** Figure 2: Dooly architecture. Workflow [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: DoolySim accuracy on four model architectures with vLLM and FlashInfer on an A100 GPU, using ShareGPT4 at 0.5 requests per second. End-to-end performance and scheduling. Dooly predicts vLLM’s measured latency with MAPE below 5% for TTFT (Figure 3a) and below 8% for TPOT (Figure 3b) across all reported percentiles. The higher TPOT error reflects per-token decode latency’s small absolute magnitude: sub-milli… view at source ↗

**Figure 4.** Figure 4: Per-batch latency predictions for three attention backends on A100 and H100, using Llama-3.1-8B. Configuration coverage. To validate that DoolySim expresses the configuration space of §2.1, we simulate the prefill-heavy workload on Llama-3.1-8B across three attention backends on both A100 (Figure 4a) and H100 (Figure 4b) GPUs. Dooly predicts TTFT within 0.5–8.0% error across all configurations, with the la… view at source ↗

**Figure 5.** Figure 5: Deduplication in Dooly amortizes profiling costs by 56%, sharing latency measurements across configurations. Model labels are defined in Appendix I. Group Variant N R Profile (h) Saved (h) Red. (%) Attention (aggregate) 42 27 10.86 14.24 56.7 32/8/128 24 21 2.83 11.62 80.4 28/4/128 6 3 2.24 2.24 50.0 32/32/128 6 3 0.35 0.38 52.2 window=4K 3 0 2.24 0.00 0.0 window=32K 3 0 3.20 0.00 0.0 Linear (aten::linear… view at source ↗

**Figure 6.** Figure 6: Bottom-up resolution process. Sufficiency of a single trace. A natural concern is whether a single dummy-prompt trace can cover both prefill and decode call paths. Phase-dependent branching is driven entirely by token-count fields in the attention metadata that the serving engine passes through the forward context, and only context-dependent modules consume this metadata; all other operations in the runnab… view at source ↗

**Figure 7.** Figure 7: Latency database schema. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: The MAPE of TTFT and TPOT predictions are 2% and 5% respectively. The simulated [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 8.** Figure 8: DoolySim accuracy on five model architectures with vLLM and FlashInfer on an H100 GPU, using ShareGPT4 at 1.0 requests per second. G Per-Batch Prediction Accuracy Experiment Setup. Instead of streaming requests, we run a single batch with a fixed size and measure its latency. This isolates per-batch latency prediction accuracy from scheduling effects, which is critical for accurately modeling the serving e… view at source ↗

**Figure 9.** Figure 9: DoolySim accurately predicts the per-batch latency on Command-R7B with three attention backends on both A100 and H100 GPUs for a prefill-heavy workload. FLASHINFER FLASH_ATTN TRITON_ATTN measured predicted 512 1K 2K 4K 8K 16K 512 1K 2K 4K 8K 16K Token Count 0 0.01 0.02 Latency [s] 5.9% 7.6% 4.4% 5.6% 4.6% 4.9% 3.2% 4.0% 3.1% 3.5% 3.5% 3.4% 7.6% 5.0% 4.7% 3.6% 3.9% 1.6% 6.6% 7.1% 4.9% 5.0% 5.1% 3.5% 3.7% 3.… view at source ↗

**Figure 10.** Figure 10: DoolySim accurately predicts the per-batch latency across different model and attention backend selections on A100 and H100 GPUs for a decode-heavy workload. See [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: DoolySim correctly predicts the inversion points across different model and attention backend selections on an A100 GPU. See [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: DoolySim correctly predicts the inversion points across different model and attention backend selections on an H100 GPU. DoolySim’s high per-batch accuracy enables it to correctly predict the inversion points across model and attention backend selections on both A100 ( [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: SGLang counterpart to [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dooly uses taint propagation on one pass to label config-fixed versus request-dependent dimensions, letting a shared latency database cut profiling GPU time by over half while holding simulation error to 5-8% MAPE on the tested models.

read the letter

Dooly is a practical improvement for anyone running profile-based simulators for LLM inference. The main move is to run a single inference pass, apply taint propagation to tag each operation input by whether its dimension comes from the model config or the incoming request, then only profile operations missing from a shared latency database. They also reuse the serving engine's own initialization code for stateful operations like attention instead of writing custom hooks. This structure lets the same profiled data cover many configurations that share common values such as head size or layer count.

Referee Report

3 major / 2 minor

Summary. The paper presents Dooly, a profiling system for LLM inference simulators that performs one inference pass with taint propagation to label each operation input dimension as model-configuration-fixed or request-dependent. Only absent operations are profiled; stateful operations (e.g., attention) reuse the serving engine's own initialization code. A latency database plus regression models are then used as a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and 12 models, Dooly reports simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while cutting profiling GPU-hours by 56.4% versus exhaustive per-configuration profiling.

Significance. If the taint-labeling and regression generalization hold, the work would meaningfully lower the cost of configuration exploration for LLM serving, a currently expensive step that limits both research and production tuning. The configuration-agnostic database and engine-reuse technique for stateful ops are pragmatic contributions that avoid per-model manual instrumentation.

major comments (3)

[§3.2] §3.2 (Taint propagation): The central claim that a single inference pass with taint labeling suffices to separate reusable (config-fixed) from non-reusable (request-dependent) dimensions for every operation is load-bearing for both the 56.4% GPU-hour reduction and the reported MAPE. The manuscript does not demonstrate handling of derived or conditional dimensions (e.g., effective head size after model-specific reshape, or KV-cache sizes that mix config and runtime state). If any such case is mislabeled, the selective-profiling savings and regression predictions for unseen configurations would be invalidated.
[§5] §5 (Evaluation and regression models): The 5% TTFT / 8% TPOT MAPE figures and the 56.4% reduction are presented without details on regression fitting procedure, feature set, data exclusion rules, train/test split, or outlier handling. This makes it impossible to assess whether the accuracy numbers are robust or result from post-hoc selection on the 12 evaluated models and three backends.
[Table 2] Table 2 / results breakdown: The aggregate 56.4% GPU-hour saving is reported across 12 models, but no per-model or per-backend breakdown is given. Without it, it is unclear whether the savings are consistent or concentrated in a subset of architectures where many operations happen to be reusable.

minor comments (2)

[Abstract] The abstract and introduction would benefit from an early, explicit list of the 12 models, two platforms, and three backends to allow readers to judge the diversity claim.
A small worked example (e.g., a 2-layer transformer) showing taint labels on a concrete operation would clarify the dimension-origin classification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications where possible and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Taint propagation): The central claim that a single inference pass with taint labeling suffices to separate reusable (config-fixed) from non-reusable (request-dependent) dimensions for every operation is load-bearing for both the 56.4% GPU-hour reduction and the reported MAPE. The manuscript does not demonstrate handling of derived or conditional dimensions (e.g., effective head size after model-specific reshape, or KV-cache sizes that mix config and runtime state). If any such case is mislabeled, the selective-profiling savings and regression predictions for unseen configurations would be invalidated.

Authors: We agree that explicit demonstration of derived and conditional dimension handling is necessary to fully support the central claim. Dooly’s taint propagation tracks dimension origins through the full computation graph, including reshapes, concatenations, and arithmetic used for sizes such as KV-cache (where a dimension may combine a config-fixed head size with a request-dependent sequence length, receiving a mixed label). Operations with any request-dependent component are profiled. The current §3.2 description is high-level; we will revise it to add concrete examples and pseudocode for reshape-derived head sizes and mixed KV-cache calculations, showing how labels are unioned from source dimensions. This will confirm that mislabeling does not occur for the evaluated models. revision: yes
Referee: [§5] §5 (Evaluation and regression models): The 5% TTFT / 8% TPOT MAPE figures and the 56.4% reduction are presented without details on regression fitting procedure, feature set, data exclusion rules, train/test split, or outlier handling. This makes it impossible to assess whether the accuracy numbers are robust or result from post-hoc selection on the 12 evaluated models and three backends.

Authors: We acknowledge that the absence of these details hinders independent assessment of robustness. The regression models use features including operation type, input/output shapes, hardware platform, and attention backend, fitted via standard regression techniques on profiled latencies. Data were collected across all 12 models and three backends with an 80/20 train/test split by configuration, outliers removed via IQR, and no post-hoc model selection. We will expand §5 with a new subsection providing the complete feature set, fitting procedure, split strategy, exclusion rules, and outlier handling. This will allow readers to verify the reported MAPE values. revision: yes
Referee: [Table 2] Table 2 / results breakdown: The aggregate 56.4% GPU-hour saving is reported across 12 models, but no per-model or per-backend breakdown is given. Without it, it is unclear whether the savings are consistent or concentrated in a subset of architectures where many operations happen to be reusable.

Authors: We agree that a per-model and per-backend breakdown would better demonstrate consistency of the savings. While the aggregate reflects overall benefit across diverse architectures, granular data would clarify where redundancy is highest. We will add an extended table (or new supplementary table) reporting GPU-hour savings, percentage of reusable operations, and profiled operation counts for each of the 12 models and each of the three attention backends. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; method is empirically grounded.

full rationale

The paper describes an empirical workflow: a single inference pass with taint propagation to label dimension origins, selective profiling of operations absent from a latency database, isolation of stateful ops via the serving engine's own code, and construction of regression models from measured data. No equations, derivations, or predictions are shown that reduce by construction to fitted parameters or self-referential definitions. Accuracy figures (5% MAPE TTFT, 8% TPOT) and the 56.4% GPU-hour reduction are presented as direct comparisons against full profiling and actual inference runs on 12 models across platforms and backends. The central claims rest on external measurements rather than self-citation chains, uniqueness theorems, or ansatzes imported from prior work, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides limited detail; the central claim rests on the domain assumption that model-configuration values recur and that taint propagation correctly captures all dependencies.

axioms (1)

domain assumption Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations
Explicitly stated as the structural understanding that enables reuse.

pith-pipeline@v0.9.0 · 5577 in / 1277 out tokens · 40287 ms · 2026-05-11T02:37:15.956351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

URLhttps://developer.nvidia.com/cupti

Nvidia cuda profiling tools interface, 2025. URLhttps://developer.nvidia.com/cupti

work page 2025
[2]

URL https://docs.pytorch.org/tutorials/ recipes/recipes/profiler_recipe.html

Pytorch profiler documentation, Jul 2026. URL https://docs.pytorch.org/tutorials/ recipes/recipes/profiler_recipe.html

work page 2026
[3]

Vidur: A large-scale simulation framework for llm inference, 2024

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gula- vani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for llm inference, 2024. URLhttps://arxiv.org/abs/2405.05465

work page arXiv 2024
[4]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve, 2024. URLhttps://arxiv.org/abs/2403.02310

work page arXiv 2024
[5]

Revati: Transparent gpu-free time-warp emulation for llm serving,

Amey Agrawal, Mayank Yadav, Sukrit Kumar, Anirudha Agrawal, Garv Ghai, Souradeep Bera, Elton Pinto, Sirish Gambhira, Mohammad Adain, Kasra Sohrab, Chus Antonanzas, and Alexey Tumanov. Revati: Transparent gpu-free time-warp emulation for llm serving, 2026. URL https://arxiv.org/abs/2601.00397

work page arXiv 2026
[6]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URLhttps://arxiv.org/abs/2305.13245

work page internal anchor Pith review arXiv 2023
[7]

vllm v1 performance optimization, 2026

AMD. vllm v1 performance optimization, 2026. URL https://rocm.docs.amd.com/en/ latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html

work page 2026
[8]

Flowdroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps

Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. Flowdroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. InProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementatio...

work page arXiv 2014
[9]

Peters, and Arman Cohan

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer,

work page
[10]

URLhttps://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2004
[11]

Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B. Gibbons. Slos-serve: Optimized serving of multi-slo llms, 2025. URL https://arxiv.org/abs/ 2504.08784

work page arXiv 2025
[12]

Enhanced system- level coherence for heterogeneous unified memory architectures,

Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, and Jongse Park. Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale. In2024 IEEE Inter- national Symposium on Workload Characterization (IISWC), page 15–29. IEEE, Septem- ber 2024. doi: 10 .1109/iiswc63097.2024.00012. URL http://dx.doi.org/10.1109/ IISWC63097.2024.00012

work page arXiv 2024
[13]

Llmservingsim2.0: A unified simulator for het- erogeneous hardware and serving techniques in llm infrastructure.IEEE Computer Architecture Letters, 24(2):361–364, July 2025

Jaehong Cho, Hyunmin Choi, and Jongse Park. Llmservingsim2.0: A unified simulator for het- erogeneous hardware and serving techniques in llm infrastructure.IEEE Computer Architecture Letters, 24(2):361–364, July 2025. ISSN 2473-2575. doi: 10 .1109/lca.2025.3628325. URL http://dx.doi.org/10.1109/LCA.2025.3628325

work page doi:10.1109/lca.2025.3628325 2025
[14]

Coherelabs/c4ai-command-r7b-12-2024

Cohere. Coherelabs/c4ai-command-r7b-12-2024. https://huggingface.co/CohereLabs/ c4ai-command-r7b-12-2024/

work page 2024
[15]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/ 2205.14135

work page internal anchor Pith review arXiv 2022
[16]

Adapting vidur to vllm and profiling cpu overhead

duanzhaol. Adapting vidur to vllm and profiling cpu overhead. https://github.com/ microsoft/vidur/issues/51, 2025. 10

work page 2025
[17]

Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N

William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P. Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. Taintdroid: An information-flow tracking system for realtime privacy monitoring on smartphones.ACM Trans. Comput. Syst., 32 (2), June 2014. ISSN 0734-2071. doi: 10 .1145/2619091. URL https://doi.org/10.1145/ 2619091

work page 2014
[18]

Frontier: Simulating the next generation of llm inference systems, 2025

Yicheng Feng, Xin Tan, Kin Hang Sew, Yimin Jiang, Yibo Zhu, and Hong Xu. Frontier: Simulating the next generation of llm inference systems, 2025. URL https://arxiv.org/ abs/2508.03148

work page arXiv 2025
[19]

How to get the profile csv of vllm instead of tensorrt-llm? https://github.com/ casys-kaist/LLMServingSim/issues/10, 2025

fwyc0573. How to get the profile csv of vllm instead of tensorrt-llm? https://github.com/ casys-kaist/LLMServingSim/issues/10, 2025

work page 2025
[20]

Perfetto trace viewer

Google. Perfetto trace viewer. URLhttps://perfetto.dev

work page
[21]

Questions about simulation fidelity under vllm version differences, profiling utiliza- tion, and throughput estimation

hariag. Questions about simulation fidelity under vllm version differences, profiling utiliza- tion, and throughput estimation. https://github.com/microsoft/apex_plus/issues/8, 2025

work page 2025
[22]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Rogers, Tor M

Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G. Rogers, Tor M. Aamodt, and Nikos Hardavellas. Accelwattch: A power modeling framework for modern gpus. MICRO ’21, page 738–753, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450385572. doi: 10 .1145/3466752.3480063. URL https://doi.org/10.1145...

work page doi:10.1145/3466752.3480063 2021
[24]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[25]

Apex: An extensible and dynamism-aware simulator for automated parallel execution in llm serving,

Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, and Fanny Nina Paravecino. Apex: An extensible and dynamism-aware simulator for automated parallel execution in llm serving, 2025. URL https://arxiv.org/abs/2411.17651

work page arXiv 2025
[26]

meta-llama/llama-3.1-8b.https://huggingface.co/meta-llama/Llama-3.1-8B

Meta. meta-llama/llama-3.1-8b.https://huggingface.co/meta-llama/Llama-3.1-8B

work page
[27]

Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software

James Newsome and Dawn Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In12th Annual Network and Distributed System Security Symposium (NDSS), San Diego, California, 2005

work page 2005
[28]

openchat_sharegpt4_dataset

OpenChat. openchat_sharegpt4_dataset. https://huggingface.co/datasets/openchat/ openchat_sharegpt4_dataset

work page
[29]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

work page 2019
[30]

Patel, E

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting, 2024. URLhttps://arxiv.org/abs/2311.18677. 11

work page arXiv 2024
[31]

GitHub - pytorch/kineto: A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters

PyTorch. GitHub - pytorch/kineto: A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters. — github.com. https://github.com/ pytorch/kineto

work page
[32]

The anatomy of a triton attention kernel, 2025

Burkhard Ringlein, Jan van Lunteren, Radu Stoica, and Thomas Parnell. The anatomy of a triton attention kernel, 2025. URLhttps://arxiv.org/abs/2511.11581

work page arXiv 2025
[33]

Running problems with replica scheduler orca/sarathi

rxz 0420. Running problems with replica scheduler orca/sarathi. https://github.com/ microsoft/vidur/issues/64, 2025

work page 2025
[34]

Adding new model

samarth1612. Adding new model. https://github.com/microsoft/vidur/issues/32, 2024

work page 2024
[35]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer.CoRR, abs/1701.06538, 2017. URLhttp://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Trtion.https://triton-lang.org/main/index.html, 2020

Philippe Tillet. Trtion.https://triton-lang.org/main/index.html, 2020

work page 2020
[37]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Optimization and tuning, 2026

vLLM. Optimization and tuning, 2026. URL https://docs.vllm.ai/en/stable/ configuration/optimization/#attention-backend-selection

work page 2026
[39]

arXiv preprint arXiv:2501.15383 , year=

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

work page arXiv 2025
[40]

Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/ abs/2501.01005

work page arXiv 2025
[41]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. URL https://arxiv.org/abs/2312.07104

work page internal anchor Pith review arXiv 2024
[42]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024. URLhttps://arxiv.org/abs/2401.09670. 12 A Comparison with Existing LLM Simulators Dooly RT [5] VD [3] FT [17] LS[11] AP [24] LS2 [12] Profiler Automatic Mo...

work page arXiv 2024
[43]

Error: missing context

work page
[44]

Sufficiency of a single trace.A natural concern is whether a single dummy-prompt trace can cover both prefill and decode call paths

Absorb: children absorbed by parent Import & Run SuccessFailure Resolved Fallback to parent Runnable Set:Attention Figure 6:Bottom-up resolution process. Sufficiency of a single trace.A natural concern is whether a single dummy-prompt trace can cover both prefill and decode call paths. Phase-dependent branching is driven entirely by token-count fields in ...

work page 2024