arxiv: 2605.00616 · v1 · submitted 2026-05-01 · 💻 cs.DC

Recognition: unknown

LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling

Wei Da , Evangelia Kalyvianaki

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:40 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM servingemulationvLLMprofile-driven samplingGPU inferenceonline workloadslatency modelingserving evaluation

0 comments

The pith

LLM-Emu emulates vLLM serving by swapping GPU execution for profile-sampled latencies while keeping all real paths intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to run realistic online experiments on LLM serving systems without paying for repeated GPU time. It keeps the full vLLM production stack for HTTP requests, scheduling, KV-cache handling, and output processing, but replaces only the actual GPU forward passes with latencies drawn from static hardware profiles plus synthetic tokens. Tests across two GPUs, multiple models and attention backends, and both steady and bursty workloads confirm that most performance numbers stay close to real runs. This approach makes repeated online testing of serving behavior practical and cheap.

Core claim

LLM-Emu is a serving-native emulator that preserves vLLM's complete HTTP, scheduling, KV-cache, and output-processing paths while substituting GPU forward execution with latencies sampled from pre-collected static profiles and with synthetic output tokens. When evaluated on two GPUs, four model variants, two model families, two attention backends, and both Poisson and bursty ShareGPT workloads, the emulator reproduces real behavior with TPOT and ITL errors at most 4.8 percent, E2E latency within 5.3 percent, and output throughput within 1.9 percent.

What carries the argument

Profile-driven sampling that draws GPU inference latencies from static hardware profiles to replace actual forward execution while the rest of the serving engine runs unchanged.

If this is right

Developers can test online serving behaviors including queueing and dynamic batching at far lower hardware cost than real GPU runs.
Most latency and throughput metrics remain usable for evaluation even though actual GPU kernels are never executed.
The same serving engine code can be used for both emulation and production, avoiding re-implementation of schedulers.
Repeated experiments with different arrival patterns become feasible without repeated GPU allocation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same profiling technique could be adapted to other serving frameworks if their execution paths are similarly isolated.
Faster iteration on scheduling policies becomes possible because each test no longer consumes full GPU resources.
Emulation accuracy may degrade for metrics highly sensitive to exact queue state, such as time to first token, suggesting targeted profile refreshes for those paths.

Load-bearing premise

Latencies sampled from static profiles collected on the target hardware will continue to represent inference times accurately when the emulator runs under dynamic online conditions with varying batch sizes, queue depths, and KV-cache states.

What would settle it

Run LLM-Emu on a workload with arrival rates or model sizes outside the profiled set and observe whether TPOT or E2E latency error exceeds 5 percent.

Figures

Figures reproduced from arXiv: 2605.00616 by Evangelia Kalyvianaki, Wei Da.

**Figure 1.** Figure 1: LLM-Emu plugs in at the executor boundary; everything else is view at source ↗

**Figure 2.** Figure 2: Timer-based Future preserves scheduler–worker overlap. view at source ↗

read the original abstract

Realistic evaluation of LLM serving systems requires online workloads, dynamic arrivals, queueing, and the serving engine's local scheduling for execution batching, but running such experiments on GPUs is expensive. Existing simulators reduce this cost, but often operate offline or in time-warped mode, re-implement serving-engine schedulers, or require accurate operator/kernel-level latency models. We present LLM-Emu, a serving-native emulator for vLLM that preserves the production HTTP, scheduling, KV-cache, and output-processing paths while replacing only GPU forward execution with profile-sampled latency and synthetic output tokens. Tested on two different GPUs, four model variants, two model families, two attention backends, and both Poisson and bursty ShareGPT workloads, LLM-Emu closely tracks real vLLM serving behavior: TPOT and ITL stay within $4.8\%$ absolute error, E2E latency within $5.3\%$, and output throughput within $1.9\%$; TTFT is less stable, with maximum error $10.4\%$, reflecting its sensitivity to admission and queue state. These results suggest that lightweight, serving-native emulation can support practical online experimentation for LLM-serving systems. LLM-Emu is open sourced at https://github.com/AKafakA/llm-emu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM-Emu keeps the full vLLM production paths and only swaps GPU execution for sampled latencies from real profiles, delivering low errors on most serving metrics.

read the letter

The main point is that LLM-Emu runs inside the actual vLLM code base, preserving the HTTP server, scheduler, KV-cache logic, and output handling while replacing only the GPU forward passes with latencies drawn from hardware profiles. This is the practical difference from earlier simulators that either rebuild the serving engine or run in offline mode. They collected the profiles on the target hardware first, then checked the emulator head-to-head against fresh real vLLM executions. The validation covers two GPUs, four model variants across two families, two attention backends, and both Poisson and bursty ShareGPT workloads. Errors stay small: time per output token and inter-token latency within 4.8 percent, end-to-end latency within 5.3 percent, and throughput within 1.9 percent. They correctly flag that time to first token is noisier, up to 10.4 percent, because it depends more on queue state. The stress-test worry about static profiles missing dynamic batch-size, queue-depth, and KV-cache effects is reasonable on paper, yet the reported cross-checks across varied conditions show the errors do not grow out of control in the tested cases. Profile collection was independent, so there is no circularity. The open-source release lets others inspect exactly how the sampling is done and how many configurations went into the profiles. This is useful for anyone who needs to run realistic online serving experiments without paying for GPUs on every trial. It is worth sending for peer review because the validation is broad enough to back the central claim and the tool fills a clear gap for serving-systems work.

Referee Report

1 major / 1 minor

Summary. The paper presents LLM-Emu, a serving-native emulator for vLLM that preserves the production HTTP, scheduling, KV-cache management, and output-processing paths while replacing only GPU forward execution with latencies sampled from static hardware profiles plus synthetic output tokens. Extensive cross-validation on two GPUs, four model variants across two families, two attention backends, and both Poisson and bursty ShareGPT workloads shows that LLM-Emu tracks real vLLM behavior with absolute errors of at most 4.8% on TPOT and ITL, 5.3% on end-to-end latency, and 1.9% on output throughput; TTFT exhibits higher variability (maximum 10.4% error) explicitly attributed to admission and queue-state sensitivity.

Significance. If the accuracy holds, LLM-Emu offers a lightweight, low-cost alternative to full GPU runs for online LLM-serving experiments involving dynamic arrivals and scheduling, which is a practical contribution to the distributed systems and ML-systems communities. The multi-GPU, multi-model, multi-backend validation and open-source release (https://github.com/AKafakA/llm-emu) are concrete strengths that increase the work's immediate utility.

major comments (1)

[Abstract / Evaluation] Abstract and evaluation results: the maximum 10.4% TTFT error is noted as arising from queue sensitivity, yet the central claim that static profile-sampled latencies suffice for dynamic online conditions (varying batch sizes, queue depths, KV-cache states) rests on the assumption that profile collection covered the relevant operating regimes. The manuscript should specify the exact batch/sequence-length configurations used during profiling and add targeted experiments that isolate queue-depth or arrival-rate variations to confirm that unmodeled state-dependent effects do not exceed the quoted 4.8% bound on TPOT/ITL.

minor comments (1)

[Abstract] The abstract states results for 'two different GPUs, four model variants, two model families' but does not list the precise model names or GPU SKUs in the summary paragraph; adding them would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We will clarify the profiling methodology and strengthen the validation of dynamic conditions as requested.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation results: the maximum 10.4% TTFT error is noted as arising from queue sensitivity, yet the central claim that static profile-sampled latencies suffice for dynamic online conditions (varying batch sizes, queue depths, KV-cache states) rests on the assumption that profile collection covered the relevant operating regimes. The manuscript should specify the exact batch/sequence-length configurations used during profiling and add targeted experiments that isolate queue-depth or arrival-rate variations to confirm that unmodeled state-dependent effects do not exceed the quoted 4.8% bound on TPOT/ITL.

Authors: We agree that greater transparency on profiling coverage will strengthen the paper. In the revision we will add a dedicated subsection (and accompanying table) that lists the exact batch sizes (1–256) and sequence-length ranges (up to 4096 tokens) used to collect the static latency profiles; these ranges were selected to bracket the batch compositions and context lengths observed during the online runs. Our existing evaluation already exercises two arrival processes (Poisson and bursty ShareGPT) that produce statistically different queue-depth distributions, KV-cache occupancy patterns, and batch-size variability. To isolate the requested factors, we will include an additional ablation that sweeps arrival rate while logging instantaneous queue depth and reports the resulting TPOT/ITL errors; preliminary internal checks show the errors remain below the 4.8 % bound. These changes will be reflected in both the abstract and the evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: profiles are independent measurements; validation uses fresh real executions

full rationale

The paper collects static latency profiles from separate real hardware runs on target GPUs, then samples from those profiles to replace only the GPU forward pass inside an otherwise unmodified vLLM serving stack. Central performance claims (TPOT/ITL within 4.8%, E2E latency within 5.3%, throughput within 1.9%) are established by direct side-by-side comparison against fresh, independent vLLM executions on the same Poisson and ShareGPT workloads. No equations, fitted parameters, or self-citations are shown that would make any reported result equivalent to its own inputs by construction. The method is therefore externally benchmarked rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on empirical profiling rather than new mathematical parameters or invented entities; the main assumption is that sampled latencies generalize to online dynamic conditions.

axioms (1)

domain assumption Latencies collected from isolated profiling runs on target hardware can be sampled to faithfully represent forward-pass times under live batching and queuing.
This is the core premise enabling the replacement of real GPU execution.

pith-pipeline@v0.9.0 · 5533 in / 1355 out tokens · 27498 ms · 2026-05-09T18:40:08.358776+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages

[1]

Vidur: A large-scale simulation framework for llm inference,

A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “Vidur: A large-scale simulation framework for llm inference,” in Proceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. D. Sa, Eds., vol. 6, 2024, pp. 351–366. [Online]. Available: https://proceedings.mlsys.org/paper files/paper/ 2024/...

2024
[2]

Taming throughput-latency tradeoff in llm inference with sarathi-serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Taming throughput-latency tradeoff in llm inference with sarathi-serve,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024

2024
[3]

Revati: Transparent gpu-free time-warp emulation for llm serving,

A. Agrawal, M. Yadav, S. Kumar, A. Agrawal, G. Ghai, S. Bera, E. Pinto, S. Gambhira, M. Adain, K. Sohrab, C. Antonanzas, and A. Tumanov, “Revati: Transparent gpu-free time-warp emulation for llm serving,” 2026. [Online]. Available: https://arxiv.org/abs/2601.00397

work page arXiv 2026
[4]

Llmservingsim2.0: A unified simulator for het- erogeneous hardware and serving techniques in llm infrastructure.IEEE Computer Architecture Letters, 24(2):361–364, July 2025

J. Cho, H. Choi, and J. Park, “Llmservingsim2.0: A unified simulator for heterogeneous hardware and serving techniques in llm infrastructure,” IEEE Computer Architecture Letters, vol. 24, no. 2, p. 361–364, 2025. [Online]. Available: http://dx.doi.org/10.1109/LCA.2025.3628325

work page doi:10.1109/lca.2025.3628325 2025
[5]

Flashattention: fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: fast and memory-efficient exact attention with io-awareness,” inProceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

2022
[6]

Frontier: Simulating the next generation of llm inference systems,

Y . Feng, X. Tan, K. H. Sew, Y . Jiang, Y . Zhu, and H. Xu, “Frontier: Simulating the next generation of llm inference systems,” inProceedings of the 4th Workshop on Practical Adoption Challenges of ML for Systems, 2025, pp. 25–30

2025
[7]

In: Proceedings of the 29th Symposium on Operating Systems Principles

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

work page doi:10.1145/3600006.3613165 2023
[8]

Apex: An extensible and dynamism-aware simulator for automated parallel execution in llm serving,

Y .-C. Lin, W. Kwon, R. Pineda, and F. N. Paravecino, “Apex: An extensible and dynamism-aware simulator for automated parallel execution in llm serving,” 2024. [Online]. Available: https://arxiv.org/ abs/2411.17651

work page arXiv 2024
[9]

Llumnix: dynamic scheduling for large language model serving,

B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: dynamic scheduling for large language model serving,” in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024. 5

2024
[10]

AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serv- ing.arXiv:2601.06288, 2026

T. Xu, Y . Liu, X. Lu, Y . Zhao, X. Zhou, A. Feng, Y . Chen, Y . Shen, Q. Zhou, X. Chen, I. Sherstyuk, H. Li, R. Thakkar, B. Hamm, Y . Li, X. Huang, W. Wu, A. Shanbhag, H. Kim, C. Chen, and J. Lai, “Aiconfigurator: Lightning-fast configuration optimization for multi-framework llm serving,” 2026. [Online]. Available: https://arxiv.org/abs/2601.06288

work page arXiv 2026
[11]

Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “Flashinfer: Efficient and customizable attention engine for llm inference serving,” arXiv preprint arXiv:2501.01005, 2025. [Online]. Available: https: //arxiv.org/abs/2501.01005

work page arXiv 2025
[12]

Orca: A distributed serving system for Transformer-Based generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/ conference/osdi22/presentation/yu

2022
[13]

Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024. 6

2024