pith. sign in

arxiv: 2606.29094 · v1 · pith:PBKD3KW4new · submitted 2026-06-27 · 💻 cs.LG

DiLaServe: High SLO Attainment Serving for Diffusion Language Models

Pith reviewed 2026-06-30 09:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords diffusion language modelsserving systemsSLO attainmentKV cachingconfidence-based denoisingcluster reconfigurationdeadline-aware scheduling
0
0 comments X

The pith

DiLaServe achieves up to 56.6 percentage points higher SLO attainment for diffusion language models by deadline-aware scheduling and quality-aware cluster reconfiguration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiLaServe as a cluster-level serving system built for diffusion language models, which generate multiple tokens in parallel per denoising step. It targets three production challenges: the speed-quality tradeoff from confidence-based denoising, fluctuating load that requires adaptive parallelization, and non-uniform per-step costs created by approximate KV caching. The system combines deadline-aware scheduling with confidence-threshold adjustment and solves a quality-aware optimization problem to reconfigure the cluster while modeling those heterogeneous step costs. If the approach works, DLMs could deliver their throughput advantage in real serving environments without frequent SLO violations and with negligible quality loss.

Core claim

DiLaServe is a cluster-level serving system for DLMs that enables deadline-aware scheduling and adaptive load control through confidence-threshold adjustment, and dynamically reconfigures the cluster by solving a quality-aware optimization problem while explicitly modeling the step-level heterogeneity introduced by approximate KV caching. Across multiple benchmarks and real-world traces, this yields up to 56.6 percentage points better SLO attainment and up to 46% lower end-to-end request latency while keeping accuracy drop below 1%.

What carries the argument

The quality-aware optimization problem that dynamically reconfigures the cluster while modeling step-level heterogeneity from approximate KV caching.

If this is right

  • Deadline-aware scheduling paired with confidence-threshold adjustment allows DLMs to meet latency targets while preserving output quality.
  • Accounting for non-uniform per-step costs from approximate KV caching improves decisions about parallelization levels under changing load.
  • The resulting system delivers measurable reductions in end-to-end latency alongside higher SLO attainment across evaluated traces.
  • Accuracy remains within 1% of baseline on the tested benchmarks when the optimization is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar deadline and cost-modeling techniques could transfer to other parallel token-generation architectures that exhibit variable per-step work.
  • If KV-cache approximations become more common in production, explicit heterogeneity modeling may become a standard component of serving schedulers.
  • Extending the optimization to include energy or memory constraints would be a direct next step given the current formulation.

Load-bearing premise

The modeling of step-level heterogeneity from approximate KV caching and the quality-aware optimization problem accurately reflect production dynamics, and the chosen benchmarks plus traces are representative of real serving workloads.

What would settle it

Running DiLaServe on a fresh real-world trace or benchmark workload and measuring whether the reported gains in SLO attainment and latency reduction still appear.

Figures

Figures reproduced from arXiv: 2606.29094 by Benjamin Yuanyang Hong, Kiet Pham, Shivaram Venkataraman, Tzu-Tao Chang.

Figure 1
Figure 1. Figure 1: Accuracy and throughput comparison of LLaMA [50] (autoregressive) and LLaDA [36] (diffusion) on GSM8K [10], run on a single H100 GPU. Numbers in paren￾theses denote the denoising confidence thresholds. constrained by the autoregressive nature of these models: tokens must be generated sequentially, one at a time. This inherent serialization limits parallelism and makes it diffi￾cult to scale throughput to m… view at source ↗
Figure 2
Figure 2. Figure 2: Diffusion language models. 56.6 percentage points with only a 0.9% accuracy drop under high load. 2 Background and Challenges 2.1 Diffusion Language Models Diffusion has emerged as a promising alternative paradigm to autoregressive language modeling [6, 32, 43]. Rather than generating tokens one-by-one in a left-to-right order, DLMs initialize generation from a corrupted sequence and progres￾sively refine … view at source ↗
Figure 4
Figure 4. Figure 4: Tensor parallelism for DLMs. both the achieved accuracy and the number of denoising steps decrease accordingly. This trade-off creates opportunities for improved serving performance, both for meeting individual request SLOs and for system-wide load control. At the request level, this knob enables deadline-aware adaptation: if a request is predicted to miss its SLO under the current threshold, the server ca… view at source ↗
Figure 5
Figure 5. Figure 5: DiLaServe architecture. caching while also coordinating it with dynamic confidence￾threshold control and cluster reconfiguration. 3 DiLaServe DiLaServe is a cluster-level serving system for DLMs that achieves high SLO attainment by dynamically adjusting the denoising-step confidence threshold and using an ILP-based reconfiguration framework tailored to the characteristics of DLMs. In this section, we first… view at source ↗
Figure 6
Figure 6. Figure 6: Serving performance of LLaDA on GSM8K (left) and step prediction error (right) under different step predic￾tion strategies. threshold. If this worst-case load exceeds cluster capacity, the algorithm removes this threshold from the request’s allowed set and continues to the next lower threshold; otherwise, it keeps this threshold and moves on to the next request. By doing so, the algorithm finds the widest … view at source ↗
Figure 7
Figure 7. Figure 7: Step prediction error and execution overhead at different prediction granularities. Here, execution overhead is measured as the time spent generating predictions using the predictor relative to the execution time of a denoising step. The red dashed line marks a 5% relative error upper bound over the 1-token-granularity. We use these features to train a Step Predictor that esti￾mates the remaining work of a… view at source ↗
Figure 8
Figure 8. Figure 8: RPS of the trace, DiLaServe’s cluster configuration and confidence threshold, and SLO attainment over time for all three systems on the real-world trace. System SLO Score Avg. Latency (s) Attain. Overall Low-load High-load DiLaServe 91.13% 6.80 5.75 2.26 6.33 Llumnix 60.98% 6.89 10.64 3.23 11.30 INFaaS 70.03% 6.89 8.80 2.71 9.34 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Experimental results for serving LLaDA (top) and Dream (bottom) on the three accuracy benchmarks. 1.88 2 2.12 2.25 RPS per GPU 0 25 50 75 100 SLO Attain. (%) 1.88 2 2.12 2.25 RPS per GPU 70 72 74 76 78 Accuracy (%) Fixed thresh. Dynamic thresh. Dynamic thresh. + Load control [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Confidence threshold control ablation serving LLaDA with GSM8K. SLO attainment as the SLO becomes stricter. In the mod￾erately tight SLO regime (≥ 3×), DiLaServe achieves up to 33.9 percentage points higher SLO attainment with only a 0.1% accuracy drop compared to INFaaS, and 42.3 percent￾age points higher SLO attainment with only a 0.1% accuracy drop compared to Llumnix. Under extremely tight SLOs, Di￾La… view at source ↗
Figure 12
Figure 12. Figure 12: Input and output length ablation. For the input length ablation (left two columns), output length is fixed to 256. For the output length ablation (right two columns), number of few-shot examples is fixed to 5. rate in the underlying trace. Input and Output Length [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Migration effectiveness. Enabling denoising-step￾level migration reduces P99 latency. Input length is set to 256 for all cases. # Model Instances 4 8 12 16 Normalized latency (vs. recompute-only) 1.17 1.28 1.33 1.38 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
read the original abstract

Diffusion language models (DLMs) have recently emerged as a promising alternative to conventional autoregressive language models. By generating multiple tokens in parallel during each denoising step, they offer higher inference throughput while maintaining competitive quality. However, realizing these throughput gains while meeting latency SLOs in a serving system requires addressing challenges introduced by DLMs' unique characteristics. These include navigating the speed-quality tradeoff created by confidence-based denoising, choosing appropriate parallelization levels across model instances under fluctuating load, and coordinating approximate KV caching mechanisms that introduce non-uniform per-step costs. To address these challenges, we present DiLaServe, a cluster-level serving system for DLMs. DiLaServe enables deadline-aware scheduling and adaptive load control through confidence-threshold adjustment, and dynamically reconfigures the cluster by solving a quality-aware optimization problem, while explicitly modeling the step-level heterogeneity introduced by approximate KV caching. Across multiple benchmarks and real-world traces, DiLaServe improves SLO attainment by up to 56.6 percentage points and reduces end-to-end request latency by up to 46\% while incurring less than 1\% accuracy drop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents DiLaServe, a cluster-level serving system for diffusion language models (DLMs). DLMs generate multiple tokens in parallel per denoising step but introduce speed-quality tradeoffs via confidence-based denoising, variable parallelization needs under load, and non-uniform per-step costs from approximate KV caching. DiLaServe addresses these via deadline-aware scheduling, adaptive load control through confidence-threshold adjustment, and dynamic cluster reconfiguration by solving a quality-aware optimization problem that explicitly models the step-level heterogeneity. Across benchmarks and real-world traces, it reports up to 56.6 percentage points higher SLO attainment, up to 46% lower end-to-end latency, and <1% accuracy drop.

Significance. If the empirical claims hold under the modeled costs, the work is significant for practical deployment of DLMs, which promise higher throughput than autoregressive models but face unique serving challenges. The explicit modeling of KV-cache-induced heterogeneity and the quality-aware optimizer are potential strengths if they are shown to generalize beyond the evaluated traces.

major comments (1)
  1. [experimental evaluation / modeling of KV caching] The central empirical claims (56.6 pp SLO gain, 46% latency reduction) are produced by solving the quality-aware optimization under the modeled step-level costs from approximate KV caching. The manuscript must include direct validation that these modeled per-step costs match measured execution times on real hardware under fluctuating load (e.g., in the experimental methodology or § on system implementation); without this, the optimizer may select configurations that fail to deliver the reported gains.
minor comments (1)
  1. [abstract] The abstract states quantitative gains but provides no details on experimental methodology, error bars, data exclusion rules, or statistical significance; this should be added to the abstract or a dedicated methods subsection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will incorporate the requested validation in the revision.

read point-by-point responses
  1. Referee: [experimental evaluation / modeling of KV caching] The central empirical claims (56.6 pp SLO gain, 46% latency reduction) are produced by solving the quality-aware optimization under the modeled step-level costs from approximate KV caching. The manuscript must include direct validation that these modeled per-step costs match measured execution times on real hardware under fluctuating load (e.g., in the experimental methodology or § on system implementation); without this, the optimizer may select configurations that fail to deliver the reported gains.

    Authors: We agree that direct validation of the modeled per-step costs (derived from approximate KV caching) against measured execution times on real hardware under fluctuating load is necessary to confirm the optimizer produces realizable gains. The current manuscript models these costs from profiling in the system implementation section but does not include an explicit side-by-side comparison under dynamic load conditions. We will add this validation to the experimental methodology (new subsection) by reporting measured vs. modeled per-step latencies on the same hardware and traces used for the end-to-end results, thereby strengthening the link between the quality-aware optimizer and the reported SLO/latency improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims with no load-bearing derivations or self-referential fits

full rationale

The paper is a systems/empirical work whose central claims (SLO gains, latency reductions) are produced by experimental evaluation on benchmarks and traces. No equations, optimization formulations, or modeling steps are visible in the abstract or reader's summary that reduce by construction to fitted inputs or self-citations. The quality-aware optimization and step-level heterogeneity model are presented as design choices whose accuracy is evaluated externally rather than assumed by definition. This matches the reader's assessment of no visible derivations and warrants the default non-finding of score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5729 in / 1045 out tokens · 35203 ms · 2026-06-30T09:17:49.024742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 29 canonical work pages · 12 internal anchors

  1. [1]

    Deepak Agarwal, Bo Long, Jonathan Traupman, Doris Xin, and Liang Zhang. 2014. LASER: a scalable response prediction platform for online advertising. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (New York, New York, USA) (WSDM ’14). Association for Computing Machinery, New York, NY, USA, 173–182. doi:10.1145/2556195.2556252

  2. [2]

    Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Kumar Saini. 2024. Approxi- mate caching for efficiently serving text-to-image diffusion models. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (Santa Clara, CA, USA) (NSDI’24). USENIX Association, USA, Article 65,...

  3. [3]

    Friedman, Thomas Williams, Ramesh K

    Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitaraman, and Thomas Woo. 2024. Proteus: A High-Throughput Inference-Serving System with Accuracy Scal- ing. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (La Jolla, CA, USA) (ASPLOS ’24). Asso...

  4. [4]

    Sitaraman, and Hui Guan

    Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, and Hui Guan. 2025. DiffServe: Efficiently Serving Text-to-Image Diffu- sion Models with Query-Aware Model Scaling. In Eighth Conference on Machine Learning and Systems.https://openreview.net/forum? id=1N3ShLfcTf

  5. [5]

    Anthropic. 2025. Claude Code.https://claude.com/product/claude- code. Accessed: 2025-12-10

  6. [6]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. 2021. Structured denoising diffusion mod- els in discrete state-spaces. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21). Curran Associates Inc., Red Hook, NY, USA, Article 1376, 13 pages

  7. [7]

    A is B” fail to learn “B is A

    Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=GPKTIktA0k

  8. [8]

    Tzu-Tao Chang and Shivaram Venkataraman. 2025. Eva: Cost-Efficient Cloud-Based Cluster Scheduling. In Proceedings of the Twentieth European Conference on Computer Systems (Rotterdam, Nether- lands) (EuroSys ’25). Association for Computing Machinery, New York, NY, USA, 1399–1416. doi:10.1145/3689031.3717483

  9. [9]

    Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. 2025. SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation. arXiv:2510.06303 [cs.LG] https://arxiv.org/abs/2510.06303 13

  10. [10]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Train- ing Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168

  11. [11]

    Franklin, Joseph E

    Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 613–627.https://www.usenix.org/ conference/nsdi17/technical-sessions/presentation...

  12. [12]

    Yinwei Dai, Rui Pan, Anand Iyer, Kai Li, and Ravi Netravali. 2024. Ap- parate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (Austin, TX, USA) (SOSP ’24). As- sociation for Computing Machinery, New York, NY, USA, 607–623. doi:10.1145/3694715.3695963

  13. [13]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FLASHATTENTION: fast and memory-efficient exact at- tention with IO-awareness. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Ar- ticle 1189, 16 pages

  14. [14]

    Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232

  15. [15]

    Aditya Ganjam, Faisal Siddiqui, Jibin Zhan, Xi Liu, Ion Stoica, Junchen Jiang, Vyas Sekar, and Hui Zhang. 2015. C3: Internet-Scale Control Plane for Video Quality Optimization. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 131–144.https://www.usenix. org/conference/nsdi15/technical-sess...

  16. [16]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 2, 1 (2023)

  17. [17]

    Google DeepMind. 2025. Gemini Diffusion.https://deepmind.google/ models/gemini-diffusion/. [text diffusion model]

  18. [18]

    Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 443–462.https: //www.usenix.org/conference/osdi20/presentation/gujarati

  19. [19]

    Peizhen Guo, Bo Hu, and Wenjun Hu. 2022. Sommelier: Cu- rating DNN Models for the Masses. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1876–1890. doi:10.1145/3514221.3526173

  20. [20]

    Gurobi Optimization, LLC. 2025. Gurobi Optimizer Reference Manual. https://www.gurobi.com

  21. [21]

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. 2023. DiffusionBERT: Improving Genera- tive Masked Language Models with Diffusion Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (Eds.). Asso...

  22. [22]

    Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, and Yu Wang. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. In Eighth Conference on Machine Learning and Systems. https://openreview.net/forum?id=ubIvpetAd6

  23. [23]

    Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan

  24. [24]

    In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA)(USENIX ATC ’25)

    DEEPSERVE: serverless large language model serving at scale. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA)(USENIX ATC ’25). USENIX Association, USA, Article 4, 16 pages

  25. [25]

    Leonard Kleinrock. 1975. Theory, Volume 1, Queueing Systems. Wiley-Interscience, USA

  26. [26]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  27. [27]

    Efficient memory management for large language model serving with PagedAttention, in: Proceed- ings of the 29th ACM Symposium on Operating Systems Principles, pp

    Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (Koblenz, Germany) (SOSP ’23). As- sociation for Computing Machinery, New York, NY, USA, 611–626. doi:10.1145/3600006.3613165

  28. [28]

    Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Mi- raoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and Volodymyr Kuleshov. 2025. Mercury: Ultra-Fast Language Models Based on Dif- fusion. arXiv:2506.17298 [cs.CL]https://arxiv.org/abs/2506.17298

  29. [29]

    Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Dakai An, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, and Wei Wang. 2025. KATZ: effi- cient workflow serving for diffusion models with many adapters. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA...

  30. [30]

    Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. 2025. A Survey on Diffusion Language Models. arXiv:2508.10875 [cs.CL]https: //arxiv.org/abs/2508.10875

  31. [31]

    Gon- zalez, and Ion Stoica

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gon- zalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, M...

  32. [32]

    Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-Eval: Unified Multi- Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), Yun-Nung Chen and Abhinav Rastogi (Eds.). Association for Computational Linguistics, Toronto, Canada, 47–58. doi:10.1865...

  33. [33]

    Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. 2025. dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching. arXiv:2506.06295 [cs.LG]https://arxiv.org/abs/2506.06295

  34. [34]

    Aaron Lou, Chenlin Meng, and Stefano Ermon. 2024. Discrete diffusion modeling by estimating the ratios of the data distri- bution. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria) (ICML’24). JMLR.org, Article 1333, 30 pages

  35. [35]

    Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. 2022. Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 579–596.https://www.usenix.org/conference/osdi22/ presentation/mohan

  36. [36]

    Ilyas, Theodoros Rekatsinas, and Shivaram Venkataraman

    Jason Mohoney, Devesh Sarda, Mengze Tang, Shihabur Rahman Chowdhury, Anil Pacaci, Ihab F. Ilyas, Theodoros Rekatsinas, and Shivaram Venkataraman. 2025. Quake: adaptive indexing for vector search. In Proceedings of the 19th USENIX Conference on Operating 14 Systems Design and Implementation (Boston, MA, USA) (OSDI ’25). USENIX Association, USA, Article 9, 17 pages

  37. [37]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tu- manov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Dis- tributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 5...

  38. [38]

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large Language Diffusion Models. arXiv preprint arXiv:2502.09992 (2025)

  39. [39]

    Augustus Odena, Charles Sutton, David Martin Dohan, Ellen Jiang, Henryk Michalewski, Jacob Austin, Maarten Paul Bosma, Maxwell Nye, Michael Terry, and Quoc V. Le. 2021. Program Synthesis with Large Language Models. In n/a. n/a, n/a. n/a

  40. [40]

    OpenAI. 2024. GPT-4 Technical Report. arXiv (2024).https://arxiv. org/abs/2303.08774

  41. [41]

    Anand Padmanabha Iyer, Mingyu Guan, Yinwei Dai, Rui Pan, Swapnil Gandhi, and Ravi Netravali. 2024. Improving DNN Inference Through- put Using Practical, Per-Input Compute Adaptation. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (Austin, TX, USA) (SOSP ’24). Association for Computing Machinery, New York, NY, USA, 624–639. ...

  42. [42]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 118–132. doi:10.1109/ISCA59077.2024.00019

  43. [43]

    Yadwadkar, and Christos Kozyrakis

    Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397–411.https://www.usenix.org/conference/ atc21/presentation/romero

  44. [44]

    Emma Roth. 2025. OpenAI says ChatGPT users send over 2.5 billion prompts every day. The Verge (2025).https://www.theverge.com/ news/710867/openai-chatgpt-daily-prompts-2-billionAccessed: 2025- 12-10

  45. [45]

    Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Edgar Mar- iano Marroquin, Alexander M Rush, Yair Schiff, Justin T Chiu, and Volodymyr Kuleshov. 2024. Simple and Effective Masked Diffusion Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum?id= L4uaAR4ArM

  46. [46]

    Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP ’19). Asso- ciation for Computing Machinery, N...

  47. [47]

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. 2025. Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference. arXiv...

  48. [48]

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1348–1362. doi:10.1109/HPCA61900.2025.00102

  49. [49]

    Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tar- nawski, and Ana Klimovic. 2024. DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian W...

  50. [50]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: dynamic scheduling for large lan- guage model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA) (OSDI’24). USENIX Association, USA, Article 10, 19 pages

  51. [51]

    ShareGPT Team. [n. d.]. ShareGPT.https://sharegpt.com/

  52. [52]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]https://arxiv. org/abs/2302.13971

  53. [53]

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems (KDD ’25). Association for Computing Machinery, New York, NY, USA, 5831–5841. doi:10.1145/3711896.3737413

  54. [54]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chan- dra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. MMLU-Pro: a more robust and challeng- ing multi-task language understanding benchmark. In Proceedings of the 38th International Conference...

  55. [55]

    Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dLLM v2: Efficient Block-Diffusion LLM. arXiv:2509.26328 [cs.CL] https://arxiv.org/abs/2509.26328

  56. [56]

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding. arXiv:2505.22618 [cs.CL]https://arxiv.org/abs/ 2505.22618

  57. [57]

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. 2025. Dream 7B: Diffusion Large Language Models. arXiv preprint arXiv:2508.15487 (2025)

  58. [58]

    Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica

  59. [59]

    In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)

    SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808.https://www. usenix.org/conference/nsdi23/presentation/zhang-hong

  60. [60]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA...

  61. [61]

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyuman- shan Ye, Pengrui Lu, and Pengfei Liu. 2025. DeepResearcher: Scal- ing Deep Research via Reinforcement Learning in Real-world En- vironments. In Proceedings of the 2025 Conference on Empirical 15 Methods in Natural Language Processing, Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Ros...

  62. [62]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggre- gating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA) (OSDI’24). USENIX Association, USA, Ar...

  63. [63]

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongx- uan Li. 2025. LLaDA 1.5: Variance-Reduced Preference Optimiza- tion for Large Language Diffusion Models. arXiv:2505.19223 [cs.LG] https://arxiv.org/abs/2505.19223

  64. [64]

    Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhen- zhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, and Ji-Rong Wen. 2025. LLaDA-MoE: A Sparse MoE Di...

  65. [65]

    Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien- Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. 2025. NanoFlow: towards optimal large language model serving throughput. In Proceedings of the 19th USENIX Conference on Operating Systems Design...

  66. [66]

    score": <0-10>,

    Safety Rules: - Judge only the assistant answer. - Prefer factual accuracy over style. - Penalize unsafe or harmful advice heavily. - If the request does not provide enough information to fully verify facts, score based on likely usefulness and internal consistency. - Return only valid JSON matching the required schema. - The score must be an integer from...