Recognition: unknown
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
Pith reviewed 2026-05-09 18:40 UTC · model grok-4.3
The pith
LLM-Emu emulates vLLM serving by swapping GPU execution for profile-sampled latencies while keeping all real paths intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-Emu is a serving-native emulator that preserves vLLM's complete HTTP, scheduling, KV-cache, and output-processing paths while substituting GPU forward execution with latencies sampled from pre-collected static profiles and with synthetic output tokens. When evaluated on two GPUs, four model variants, two model families, two attention backends, and both Poisson and bursty ShareGPT workloads, the emulator reproduces real behavior with TPOT and ITL errors at most 4.8 percent, E2E latency within 5.3 percent, and output throughput within 1.9 percent.
What carries the argument
Profile-driven sampling that draws GPU inference latencies from static hardware profiles to replace actual forward execution while the rest of the serving engine runs unchanged.
If this is right
- Developers can test online serving behaviors including queueing and dynamic batching at far lower hardware cost than real GPU runs.
- Most latency and throughput metrics remain usable for evaluation even though actual GPU kernels are never executed.
- The same serving engine code can be used for both emulation and production, avoiding re-implementation of schedulers.
- Repeated experiments with different arrival patterns become feasible without repeated GPU allocation.
Where Pith is reading between the lines
- The same profiling technique could be adapted to other serving frameworks if their execution paths are similarly isolated.
- Faster iteration on scheduling policies becomes possible because each test no longer consumes full GPU resources.
- Emulation accuracy may degrade for metrics highly sensitive to exact queue state, such as time to first token, suggesting targeted profile refreshes for those paths.
Load-bearing premise
Latencies sampled from static profiles collected on the target hardware will continue to represent inference times accurately when the emulator runs under dynamic online conditions with varying batch sizes, queue depths, and KV-cache states.
What would settle it
Run LLM-Emu on a workload with arrival rates or model sizes outside the profiled set and observe whether TPOT or E2E latency error exceeds 5 percent.
Figures
read the original abstract
Realistic evaluation of LLM serving systems requires online workloads, dynamic arrivals, queueing, and the serving engine's local scheduling for execution batching, but running such experiments on GPUs is expensive. Existing simulators reduce this cost, but often operate offline or in time-warped mode, re-implement serving-engine schedulers, or require accurate operator/kernel-level latency models. We present LLM-Emu, a serving-native emulator for vLLM that preserves the production HTTP, scheduling, KV-cache, and output-processing paths while replacing only GPU forward execution with profile-sampled latency and synthetic output tokens. Tested on two different GPUs, four model variants, two model families, two attention backends, and both Poisson and bursty ShareGPT workloads, LLM-Emu closely tracks real vLLM serving behavior: TPOT and ITL stay within $4.8\%$ absolute error, E2E latency within $5.3\%$, and output throughput within $1.9\%$; TTFT is less stable, with maximum error $10.4\%$, reflecting its sensitivity to admission and queue state. These results suggest that lightweight, serving-native emulation can support practical online experimentation for LLM-serving systems. LLM-Emu is open sourced at https://github.com/AKafakA/llm-emu.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents LLM-Emu, a serving-native emulator for vLLM that preserves the production HTTP, scheduling, KV-cache management, and output-processing paths while replacing only GPU forward execution with latencies sampled from static hardware profiles plus synthetic output tokens. Extensive cross-validation on two GPUs, four model variants across two families, two attention backends, and both Poisson and bursty ShareGPT workloads shows that LLM-Emu tracks real vLLM behavior with absolute errors of at most 4.8% on TPOT and ITL, 5.3% on end-to-end latency, and 1.9% on output throughput; TTFT exhibits higher variability (maximum 10.4% error) explicitly attributed to admission and queue-state sensitivity.
Significance. If the accuracy holds, LLM-Emu offers a lightweight, low-cost alternative to full GPU runs for online LLM-serving experiments involving dynamic arrivals and scheduling, which is a practical contribution to the distributed systems and ML-systems communities. The multi-GPU, multi-model, multi-backend validation and open-source release (https://github.com/AKafakA/llm-emu) are concrete strengths that increase the work's immediate utility.
major comments (1)
- [Abstract / Evaluation] Abstract and evaluation results: the maximum 10.4% TTFT error is noted as arising from queue sensitivity, yet the central claim that static profile-sampled latencies suffice for dynamic online conditions (varying batch sizes, queue depths, KV-cache states) rests on the assumption that profile collection covered the relevant operating regimes. The manuscript should specify the exact batch/sequence-length configurations used during profiling and add targeted experiments that isolate queue-depth or arrival-rate variations to confirm that unmodeled state-dependent effects do not exceed the quoted 4.8% bound on TPOT/ITL.
minor comments (1)
- [Abstract] The abstract states results for 'two different GPUs, four model variants, two model families' but does not list the precise model names or GPU SKUs in the summary paragraph; adding them would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We will clarify the profiling methodology and strengthen the validation of dynamic conditions as requested.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation results: the maximum 10.4% TTFT error is noted as arising from queue sensitivity, yet the central claim that static profile-sampled latencies suffice for dynamic online conditions (varying batch sizes, queue depths, KV-cache states) rests on the assumption that profile collection covered the relevant operating regimes. The manuscript should specify the exact batch/sequence-length configurations used during profiling and add targeted experiments that isolate queue-depth or arrival-rate variations to confirm that unmodeled state-dependent effects do not exceed the quoted 4.8% bound on TPOT/ITL.
Authors: We agree that greater transparency on profiling coverage will strengthen the paper. In the revision we will add a dedicated subsection (and accompanying table) that lists the exact batch sizes (1–256) and sequence-length ranges (up to 4096 tokens) used to collect the static latency profiles; these ranges were selected to bracket the batch compositions and context lengths observed during the online runs. Our existing evaluation already exercises two arrival processes (Poisson and bursty ShareGPT) that produce statistically different queue-depth distributions, KV-cache occupancy patterns, and batch-size variability. To isolate the requested factors, we will include an additional ablation that sweeps arrival rate while logging instantaneous queue depth and reports the resulting TPOT/ITL errors; preliminary internal checks show the errors remain below the 4.8 % bound. These changes will be reflected in both the abstract and the evaluation section. revision: yes
Circularity Check
No circularity: profiles are independent measurements; validation uses fresh real executions
full rationale
The paper collects static latency profiles from separate real hardware runs on target GPUs, then samples from those profiles to replace only the GPU forward pass inside an otherwise unmodified vLLM serving stack. Central performance claims (TPOT/ITL within 4.8%, E2E latency within 5.3%, throughput within 1.9%) are established by direct side-by-side comparison against fresh, independent vLLM executions on the same Poisson and ShareGPT workloads. No equations, fitted parameters, or self-citations are shown that would make any reported result equivalent to its own inputs by construction. The method is therefore externally benchmarked rather than self-referential.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Latencies collected from isolated profiling runs on target hardware can be sampled to faithfully represent forward-pass times under live batching and queuing.
Reference graph
Works this paper leans on
-
[1]
Vidur: A large-scale simulation framework for llm inference,
A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “Vidur: A large-scale simulation framework for llm inference,” in Proceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. D. Sa, Eds., vol. 6, 2024, pp. 351–366. [Online]. Available: https://proceedings.mlsys.org/paper files/paper/ 2024/...
2024
-
[2]
Taming throughput-latency tradeoff in llm inference with sarathi-serve,
A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Taming throughput-latency tradeoff in llm inference with sarathi-serve,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024
2024
-
[3]
Revati: Transparent gpu-free time-warp emulation for llm serving,
A. Agrawal, M. Yadav, S. Kumar, A. Agrawal, G. Ghai, S. Bera, E. Pinto, S. Gambhira, M. Adain, K. Sohrab, C. Antonanzas, and A. Tumanov, “Revati: Transparent gpu-free time-warp emulation for llm serving,” 2026. [Online]. Available: https://arxiv.org/abs/2601.00397
-
[4]
J. Cho, H. Choi, and J. Park, “Llmservingsim2.0: A unified simulator for heterogeneous hardware and serving techniques in llm infrastructure,” IEEE Computer Architecture Letters, vol. 24, no. 2, p. 361–364, 2025. [Online]. Available: http://dx.doi.org/10.1109/LCA.2025.3628325
-
[5]
Flashattention: fast and memory-efficient exact attention with io-awareness,
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: fast and memory-efficient exact attention with io-awareness,” inProceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022
2022
-
[6]
Frontier: Simulating the next generation of llm inference systems,
Y . Feng, X. Tan, K. H. Sew, Y . Jiang, Y . Zhu, and H. Xu, “Frontier: Simulating the next generation of llm inference systems,” inProceedings of the 4th Workshop on Practical Adoption Challenges of ML for Systems, 2025, pp. 25–30
2025
-
[7]
In: Proceedings of the 29th Symposium on Operating Systems Principles
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...
-
[8]
Apex: An extensible and dynamism-aware simulator for automated parallel execution in llm serving,
Y .-C. Lin, W. Kwon, R. Pineda, and F. N. Paravecino, “Apex: An extensible and dynamism-aware simulator for automated parallel execution in llm serving,” 2024. [Online]. Available: https://arxiv.org/ abs/2411.17651
-
[9]
Llumnix: dynamic scheduling for large language model serving,
B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: dynamic scheduling for large language model serving,” in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024. 5
2024
-
[10]
T. Xu, Y . Liu, X. Lu, Y . Zhao, X. Zhou, A. Feng, Y . Chen, Y . Shen, Q. Zhou, X. Chen, I. Sherstyuk, H. Li, R. Thakkar, B. Hamm, Y . Li, X. Huang, W. Wu, A. Shanbhag, H. Kim, C. Chen, and J. Lai, “Aiconfigurator: Lightning-fast configuration optimization for multi-framework llm serving,” 2026. [Online]. Available: https://arxiv.org/abs/2601.06288
-
[11]
Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “Flashinfer: Efficient and customizable attention engine for llm inference serving,” arXiv preprint arXiv:2501.01005, 2025. [Online]. Available: https: //arxiv.org/abs/2501.01005
-
[12]
Orca: A distributed serving system for Transformer-Based generative models,
G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/ conference/osdi22/presentation/yu
2022
-
[13]
Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,
Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024. 6
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.