pith. machine review for the scientific record. sign in

arxiv: 2604.28175 · v1 · submitted 2026-04-30 · 💻 cs.LG

Recognition: unknown

Strait: Perceiving Priority and Interference in ML Inference Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords ML inference servingGPU schedulingpriority-aware schedulinginterference predictiondeadline satisfactionconcurrent executionDNN modelsdata transfer contention
0
0 comments X

The pith

Strait reduces deadline violations for high-priority ML inference tasks by 1 to 11 percentage points through interference prediction on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Strait, a serving system for machine learning inference requests that must share limited GPU resources. It builds an adaptive model to forecast extra delays from data transfers contending for bandwidth and from kernels interfering during execution when requests run concurrently. The scheduler then uses those forecasts to give high-priority requests the resources they need to finish on time. A reader would care because real deployments often mix urgent and routine queries on the same hardware, and missing deadlines on the urgent ones breaks service guarantees. Tests under heavy load show the approach cuts missed deadlines for priority work while keeping the side effects on ordinary work modest and outperforming preemption methods in fairness.

Core claim

Strait enhances deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, it models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks and exhibits more equitable performance than software-defined preemption approaches.

What carries the argument

An adaptive prediction model that estimates data-transfer contention and kernel-execution interference to drive priority-aware scheduling decisions for dual-priority ML inference workloads.

If this is right

  • High-priority inference tasks meet deadlines more reliably under intense GPU utilization.
  • Low-priority tasks experience only modest increases in latency or violations.
  • The system achieves more balanced outcomes across priority classes than preemption-based alternatives.
  • Latency estimates become more reliable when multiple DNN models execute concurrently on the same GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contention-modeling idea could be generalized to handle more than two priority levels if the prediction accuracy holds.
  • Production platforms might use this technique to run mixed-priority workloads on fewer GPUs without dedicated hardware partitions.
  • Extending the model to include network or CPU interference could improve scheduling in heterogeneous inference clusters.

Load-bearing premise

The adaptive prediction model for data-transfer contention and kernel-execution interference remains accurate enough under real production dual-priority workloads to support effective priority-aware scheduling decisions.

What would settle it

A production-style dual-priority workload run where disabling the interference prediction model produces no reduction (or an increase) in high-priority deadline violations compared to the full Strait scheduler.

Figures

Figures reproduced from arXiv: 2604.28175 by Haidong Zhao, Nikolaos Georgantas.

Figure 1
Figure 1. Figure 1: In each scenario, ResNet-50 (HP task) is co-located with a distinct model (LP task). with a batch-4 RoBERTa-B. Compared with isolated execu￾tion, the total kernel execution latency slowdown exceeds 3.4×, the total inter-kernel intervals exceed 6.6×, and con￾sequently the overall slowdown exceeds 3.6×. In contrast, when this batch-1 ResNet-50 is co-located with a batch-2 YOLO-v8n, these values drop to 1.7×,… view at source ↗
Figure 4
Figure 4. Figure 4: The figure visualizes contention during data trans￾fer (green blocks) and kernel execution (blue blocks). When using pinned memory, concurrent batch submission results in FIFO-ordered data transfers. delays. A global prediction model is used to serve all GPUs within the node to estimate kernel execution interference. This model continuously adapts to dynamic workloads to sustain prediction accuracy (Sectio… view at source ↗
Figure 5
Figure 5. Figure 5: Deadline violation rates and latency distributions for different scheduling policies under a 4-GPU node. 1 10 100 Latency (ms) 0.00 0.25 0.50 0.75 1.00 CDF (a) Pressure on HP tasks 1 10 100 Latency (ms) 0.00 0.25 0.50 0.75 1.00 CDF XSched (HP) XSched (LP) Strait (HP) Strait (LP) (b) Pressure on LP tasks view at source ↗
Figure 6
Figure 6. Figure 6: CDF of inference latency for XSched and Strait. 5.3 Comparison with Kernel-Level Scheduling We use XSched [80] to support fixed-priority scheduling [60] for Triton inference server [11]. We adopt XSched’s original configuration, where batching is not employed and kernels are directly submitted to its abstract queues. For comparison, we enable batching in Strait but eliminate the batch forma￾tion timeout. B… view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation results for inaccuracy, adaptability, and sustainability. pressure; however, this approach may be limited if a batch’s resource demands vary significantly during execution. For￾tunately, the overall latency prediction can mitigate these prediction errors, and this value is ultimately used for sched￾uling. This is because kernel execution latency is only one element of the overall latency, and we… view at source ↗
Figure 10
Figure 10. Figure 10: Impact of sequentially removing a task prioriti￾zation mechanism. sensitivity to profiling drift view at source ↗
Figure 9
Figure 9. Figure 9: Profiling drift in resource throughput relative to the baseline without drift. A Appendix A.1 Adaptive Throttling We employ an AIMD policy to throttle the aggregate resource throughput of concurrent LP tasks on the GPU (Section 3.3). We select the control parameters to prevent LP tasks from oversubscribing GPU resources without causing severe un￾derutilization view at source ↗
read the original abstract

Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios. We present \emph{Strait}, a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks. Compared to software-defined preemption approaches, Strait also exhibits more equitable performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Strait, an ML inference serving system for GPUs that employs an adaptive prediction model to estimate data-transfer contention and kernel-execution interference. These estimates feed into priority-aware scheduling for dual-priority workloads. Under intense workloads, Strait is claimed to reduce high-priority deadline violations by 1.02–11.18 percentage points relative to baselines, while imposing acceptable costs on low-priority tasks and achieving more equitable performance than software-defined preemption.

Significance. If the adaptive prediction model proves accurate and the reported gains are causally attributable to it (rather than to basic priority queuing or workload artifacts), Strait would offer a practical contribution to on-premises inference serving by improving deadline satisfaction for prioritized traffic at high GPU utilization. The empirical focus on real dual-priority workloads is a strength, but the absence of isolated model validation and detailed experimental parameters substantially weakens the immediate significance and reproducibility of the results.

major comments (3)
  1. [Abstract] Abstract: The headline quantitative claim (1.02–11.18 pp reduction in high-priority deadline violations) is presented without workload parameters (model types, request rates, GPU utilization), baseline details, number of trials, or error bars. This prevents evaluation of whether the range reflects consistent gains or sensitivity to specific conditions.
  2. [Evaluation] Evaluation section: No prediction-error metrics (e.g., latency MAE or accuracy under concurrent dual-priority execution) are reported for the adaptive contention/interference model, and no ablation disables the estimator while retaining the rest of the scheduler. Consequently, the causal link between the model’s predictions and the observed violation reductions cannot be verified; gains could arise from other scheduling mechanisms.
  3. [System Design] System Design section: The adaptive prediction model’s update triggers, input features, and handling of interference under varying priority mixes are described at too high a level to assess robustness or reproducibility. Without these internals, it is impossible to determine whether the model remains sufficiently accurate under the high-utilization workloads used in the evaluation.
minor comments (3)
  1. [Abstract] The abstract states 'acceptable costs on low-priority tasks' and 'more equitable performance' without defining the metrics (e.g., latency increase, throughput loss) or providing the corresponding quantitative values.
  2. [Related Work] Related-work discussion should explicitly compare the adaptive model to prior GPU interference predictors (e.g., those based on kernel profiling or ML-based contention estimation) to clarify novelty.
  3. [Figures/Tables] Figure and table captions would benefit from explicit statements of the number of experimental runs and the precise definition of 'deadline violation' used in the plots.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We agree that several clarifications and additions will strengthen the paper's reproducibility and help establish the contribution of the adaptive prediction model. We outline the specific revisions we plan to incorporate in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline quantitative claim (1.02–11.18 pp reduction in high-priority deadline violations) is presented without workload parameters (model types, request rates, GPU utilization), baseline details, number of trials, or error bars. This prevents evaluation of whether the range reflects consistent gains or sensitivity to specific conditions.

    Authors: We acknowledge that the abstract, due to its brevity, does not include all experimental parameters. The reported range reflects results across multiple dual-priority workloads (including ResNet-50, BERT, and VGG models at request rates that drive 80–95% GPU utilization) with FIFO and preemption baselines, averaged over 5 runs per configuration (error bars appear in the corresponding figures). In the revised manuscript we will expand the abstract by one sentence to list representative parameters and explicitly state that full workload details, baselines, and trial counts are provided in Section 5 and Table 1. revision: yes

  2. Referee: [Evaluation] Evaluation section: No prediction-error metrics (e.g., latency MAE or accuracy under concurrent dual-priority execution) are reported for the adaptive contention/interference model, and no ablation disables the estimator while retaining the rest of the scheduler. Consequently, the causal link between the model’s predictions and the observed violation reductions cannot be verified; gains could arise from other scheduling mechanisms.

    Authors: We agree that isolating the estimator’s contribution is important for establishing causality. The current evaluation focuses on end-to-end system performance, but we will add (1) latency prediction MAE and accuracy figures under concurrent dual-priority execution and (2) an ablation that compares the full Strait scheduler against a priority-aware baseline that disables the adaptive estimator (relying only on static estimates and basic queuing). These additions will be placed in a new subsection of the evaluation and will directly address whether the observed 1.02–11.18 pp reductions are attributable to the contention and interference modeling. revision: yes

  3. Referee: [System Design] System Design section: The adaptive prediction model’s update triggers, input features, and handling of interference under varying priority mixes are described at too high a level to assess robustness or reproducibility. Without these internals, it is impossible to determine whether the model remains sufficiently accurate under the high-utilization workloads used in the evaluation.

    Authors: We accept that the system-design description is currently high-level. The adaptive model performs online updates when observed latency deviates beyond a configurable threshold (currently 15%), using input features that include instantaneous GPU utilization, per-request data-transfer size, kernel launch parameters, and the current high/low-priority request ratio. Interference is modeled via two separate lightweight regressors (one for PCIe contention, one for SM/kernel interference) that are retrained on recent observations. In the revised version we will add pseudocode for the update loop, an explicit list of input features with their normalization, and a paragraph describing behavior under different priority mixes (e.g., 70/30 vs. 90/10). These details will allow readers to assess robustness at the high-utilization regimes reported in the evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical system design and evaluation

full rationale

The paper presents Strait as an empirical ML inference serving system that incorporates an adaptive prediction model for data-transfer contention and kernel-execution interference to enable priority-aware scheduling. No mathematical derivation chain, equations, or first-principles results are described. The central claims rest on end-to-end experimental results (deadline violation reductions under workloads) rather than any fitted parameter renamed as a prediction or self-referential definition. No self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or description to support load-bearing steps. The contribution is self-contained as a systems engineering and evaluation effort without reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit mathematical axioms, free parameters, or invented physical entities appear in the abstract. The adaptive prediction model may implicitly contain fitted parameters, but none are named or quantified here.

pith-pipeline@v0.9.0 · 5435 in / 1202 out tokens · 100686 ms · 2026-05-07T07:43:42.265174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

115 extracted references · 62 canonical work pages · 4 internal anchors

  1. [1]

    Terminology used in Nsight Compute.https: //stackoverflow.com/questions/63403203/terminology-used- in-nsight-compute?rq=1

    2020. Terminology used in Nsight Compute.https: //stackoverflow.com/questions/63403203/terminology-used- in-nsight-compute?rq=1

  2. [2]

    Apache Hadoop YARN.https://hadoop.apache.org/docs/ current/hadoop-yarn/hadoop-yarn-site/YARN.html xi Haidong Zhao and Nikolaos Georgantas

    2026. Apache Hadoop YARN.https://hadoop.apache.org/docs/ current/hadoop-yarn/hadoop-yarn-site/YARN.html xi Haidong Zhao and Nikolaos Georgantas

  3. [3]

    CUDA C++ Programming Guide: v13.1.https://docs.nvidia

    2026. CUDA C++ Programming Guide: v13.1.https://docs.nvidia. com/cuda/pdf/CUDA_C_Programming_Guide.pdf

  4. [4]

    Kubernetes.https://kubernetes.io/

    2026. Kubernetes.https://kubernetes.io/

  5. [5]

    MULTI-PROCESS SERVICE: vR590.https://docs.nvidia.com/ deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

    2026. MULTI-PROCESS SERVICE: vR590.https://docs.nvidia.com/ deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

  6. [6]

    nuScenes.https://www.nuscenes.org/nuscenes#data- collection

    2026. nuScenes.https://www.nuscenes.org/nuscenes#data- collection

  7. [7]

    NVIDIA ADA LOVELACE PROFESSIONAL GPU AR- CHITECTURE.https://images.nvidia.com/aem-dam/en- zz/Solutions/technologies/NVIDIA-ADA-GPU-PROVIZ- Architecture-Whitepaper_1.1.pdf

    2026. NVIDIA ADA LOVELACE PROFESSIONAL GPU AR- CHITECTURE.https://images.nvidia.com/aem-dam/en- zz/Solutions/technologies/NVIDIA-ADA-GPU-PROVIZ- Architecture-Whitepaper_1.1.pdf

  8. [8]

    NVIDIA Dynamo Platform.https://developer.nvidia.com/ dynamo

    2026. NVIDIA Dynamo Platform.https://developer.nvidia.com/ dynamo

  9. [9]

    NVIDIA Nsight Compute.https://developer.nvidia.com/nsight- compute

    2026. NVIDIA Nsight Compute.https://developer.nvidia.com/nsight- compute

  10. [10]

    NVIDIA Nsight Systems.https://developer.nvidia.com/nsight- systems

    2026. NVIDIA Nsight Systems.https://developer.nvidia.com/nsight- systems

  11. [11]

    NVIDIA Triton Inference Server.https://developer.nvidia.com/ triton-inference-server

    2026. NVIDIA Triton Inference Server.https://developer.nvidia.com/ triton-inference-server

  12. [12]

    ONNX Runtime.https://onnxruntime.ai/

    2026. ONNX Runtime.https://onnxruntime.ai/

  13. [13]

    Tensorflow Serving shared batch scheduler.https: //github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ kernels/batching_util/shared_batch_scheduler.h

    2026. Tensorflow Serving shared batch scheduler.https: //github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ kernels/batching_util/shared_batch_scheduler.h

  14. [14]

    TensorRT Documentation.https://docs.nvidia.com/ deeplearning/tensorrt/

    2026. TensorRT Documentation.https://docs.nvidia.com/ deeplearning/tensorrt/

  15. [15]

    TorchServe.https://pytorch.org/serve/

    2026. TorchServe.https://pytorch.org/serve/

  16. [16]

    Vivek Adarsh, Michael Nekrasov, Udit Paul, Tarun Mangla, Arpit Gupta, Morgan Vigil-Hayes, Ellen Zegura, and Elizabeth Belding

  17. [17]

    In2021 International Conference on Computer Communications and Networks (ICCCN)

    Coverage is Not Binary: Quantifying Mobile Broadband Quality in Urban, Rural, and Tribal Contexts. In2021 International Conference on Computer Communications and Networks (ICCCN). 1–9. doi:10. 1109/ICCCN52240.2021.9522152

  18. [18]

    Evidently AI. 2025. What is Concept Drift in ML, and How to Detect and Address It.https://www.evidentlyai.com/ml-in-production/ concept-drift

  19. [19]

    Anderson, and F

    Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson Smith. 2017. GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed. In2017 IEEE Real-Time Systems Symposium (RTSS). 104–115. doi:10.1109/RTSS.2017.00017

  20. [20]

    Romil Bhardwaj, Kirthevasan Kandasamy, Asim Biswal, Wenshuo Guo, Benjamin Hindman, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2023. Cilantro: Performance-Aware Resource Allocation for General Objectives via Online Feedback. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 623–643. https://www.usenix.org/conference/os...

  21. [21]

    Sumon Kumar Bose, Bapi Kar, Mohendra Roy, Pradeep Kumar Gopalakrishnan, and Arindam Basu. 2019. ADEPOS: Anomaly de- tection based power saving for predictive maintenance using edge computing. InProceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC ’19). Association for Computing Machinery, 597–602. doi:10.1145/3287624.3287716

  22. [22]

    Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In11th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 14). 285–300.https://www.usenix.org/conference/osdi14/ technical-sessions/presentation/boutin

  23. [23]

    Cleveland, Dong Lin, and Don X

    Jin Cao, William S. Cleveland, Dong Lin, and Don X. Sun. 2001. On the nonstationarity of Internet traffic. InProceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’01). 102–112. doi:10.1145/378420. 378440

  24. [24]

    Bohsun Chen. 2024. Understanding Huber Loss function: Insights from Applications.https://medium.com/@devcharlie2698619/ understanding-huber-loss-function-insights-from-applications- 5c1c5145d2c4

  25. [25]

    Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse- Scale Computers. InProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). ...

  26. [26]

    Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Bay- max: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. InProceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). As- sociation for Computing Machinery,...

  27. [27]

    Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Shar- ing. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 199–216.https://www.usenix.org/conference/atc22/presentation/ choi-seungbeom

  28. [28]

    Brad Cline, Radu Stefan Niculescu, Duane Huffman, and Bob Deckel

  29. [29]

    In 2017 Annual Reliability and Maintainability Symposium (RAMS)

    Predictive maintenance applications for machine learning. In 2017 Annual Reliability and Maintainability Symposium (RAMS). 1–7. doi:10.1109/RAM.2017.7889679

  30. [30]

    InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)

    Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, and Dimitrios Skarlatos. 2025. LithOS: An Operating System for Efficient Machine Learning on GPUs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). 1–17. doi:10. 1145/3731569.3764818

  31. [31]

    Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Sto- ica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: Latency- Aware Provisioning and Scaling for Prediction Serving Pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC ’20). 477–491. doi:10.1145/3419111.3421285

  32. [32]

    Franklin, Joseph E

    Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613– 627.https://www.usenix.org/conference/nsdi17/technical-sessions/ presentation/crankshaw

  33. [33]

    Dally, Stephen W

    William J. Dally, Stephen W. Keckler, and David B. Kirk. 2021. Evolu- tion of the Graphics Processing Unit (GPU).IEEE Micro41, 6 (2021), 42–51. doi:10.1109/MM.2021.3113475

  34. [34]

    Priyanka Das. 2024. Real-Time IoT-Based Predictive Maintenance System for Automotive Assembly Lines.Fuel Cells Bulletin(02 2024). doi:10.52710/fcb.224

  35. [35]

    Ribeiro, Pedro Mota Pereira, and João Gama

    Narjes Davari, Bruno Veloso, Rita P. Ribeiro, Pedro Mota Pereira, and João Gama. 2021. Predictive maintenance based on anomaly detection using deep learning for air production unit in the railway industry. In 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA). 1–10. doi:10.1109/DSAA53316.2021.9564181

  36. [36]

    Aditya Dhakal, Sameer G Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. InProceedings of the 11th ACM Symposium on Cloud Com- puting (SoCC ’20). 492–506. doi:10.1145/3419111.3421284

  37. [37]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transform- ers for Image Recognition at Scale.https://arxiv.org/abs/2010.11929 xii Strait: Perceiving Priorit...

  38. [38]

    John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Sub- gradient Methods for Online Learning and Stochastic Optimiza- tion.Journal of Machine Learning Research12, 61 (2011), 2121–2159. http://jmlr.org/papers/v12/duchi11a.html

  39. [39]

    Paul Elvinger, Foteini Strati, Natalie Enright Jerger, and Ana Klimovic

  40. [40]

    InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC ’25)

    Understanding GPU Resource Interference One Level Deeper. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC ’25). 687–694. doi:10.1145/3772052.3772270

  41. [41]

    GigaSpaces. 2023. Amazon Found Every 100ms of Latency Cost them 1% in Sales.https://www.gigaspaces.com/blog/amazon-found-every- 100ms-of-latency-cost-them-1-in-sales

  42. [42]

    Ogden, Tian Guo, and Robert J

    Guin Gilman, Samuel S. Ogden, Tian Guo, and Robert J. Walls. 2021. Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels.SIGMETRICS Perform. Eval. Rev.48, 3 (March 2021), 81–88. doi:10.1145/3453953.3453972

  43. [43]

    Guin Gilman and Robert J. Walls. 2021. Characterizing concur- rency mechanisms for NVIDIA GPUs under deep learning workloads. Performance Evaluation151 (2021), 102234. doi:10.1016/j.peva.2021. 102234

  44. [44]

    Roger Grosse. 2017. Lecture 8: Optimization.https://www.cs.toronto. edu/~cmaddis/courses/sta314_f25/rgrosse_optimization_notes.pdf

  45. [45]

    Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In14th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 20). 443–462.https://www.usenix.org/conference/osdi20/ presentation/gujarati

  46. [46]

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539–558.https://www.usenix. org/conference/osdi22/presentation/han

  47. [47]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. doi:10. 1109/CVPR.2016.90

  48. [48]

    Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neu- ral Networks for Machine Learning.https://www.cs.toronto.edu/ ~tijmen/csc321/slides/lecture_slides_lec6.pdf

  49. [49]

    Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. 2021. Scrooge: A Cost-Effective Deep Learning Inference System. InProceedings of the ACM Symposium on Cloud Computing (SoCC ’21). 624–638. doi:10.1145/3472883.3486993

  50. [50]

    Szu-Hao Huang and Ying-Cheng Pan. 2015. Automated visual inspec- tion in the semiconductor industry: A survey.Computers in Industry 66 (2015), 1–10. doi:10.1016/j.compind.2014.10.006

  51. [51]

    Nebbiolo Technologies Inc. 2020. Audi’s Automated Factory Moves Closer to Industry 4.0 with Intel’s Edge Machine Learning and Nebbiolo Technologies’ Intelligent Edge Computing Software Platform.https://www.prweb.com/releases/audi-s-automated- factory-moves-closer-to-industry-4-0-with-intel-s-edge-machine- learning-and-nebbiolo-technologies-intelligent-edg...

  52. [52]

    Rakshith Jayanth, Neelesh Gupta, and Viktor Prasanna. 2024. Bench- marking Edge AI Platforms for High-Performance ML Inference. https://arxiv.org/abs/2409.14803

  53. [53]

    Beomyeol Jeon, Chen Wang, Diana Arroyo, Alaa Youssef, and In- dranil Gupta. 2025. A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro. In Proceedings of the Twentieth European Conference on Computer Sys- tems (EuroSys ’25). Association for Computing Machinery, 524–540. doi:10.1145/3689031.3696071

  54. [54]

    Yizhou Jin, Yu Lu, Gang Zhou, Qingjie Liu, and Yunhong Wang. 2023. Glass Wool Defect Detection Using an Improved YOLOv5. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4385–4394. doi:10.1109/CVPRW59228.2023. 00461

  55. [55]

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023. Ultralytics YOLOv8.https://github.com/ultralytics/ultralytics

  56. [56]

    Karumbunathan

    Leela S. Karumbunathan. July 2022. NVIDIA Jetson AGX Orin Series: Technical Brief.https://www.nvidia.com/content/dam/en- zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical- brief.pdf

  57. [57]

    Sejin Kim and Yoonhee Kim. 2021. Interference-aware execution framework with Co-scheML on GPU clusters.Cluster Computing26, 5 (May 2021), 2577–2589. doi:10.1007/s10586-021-03299-z

  58. [58]

    Yeonjae Kim, Igjae Kim, Kwanghoon Choi, Jeongseob Ahn, Jongse Park, and Jaehyuk Huh. 2024. Interference-Aware DNN Serving on Heterogeneous Processors in Edge Systems. In2024 IEEE 42nd International Conference on Computer Design (ICCD). 199–206. doi:10. 1109/ICCD63220.2024.00038

  59. [59]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Sto- chastic Optimization.https://arxiv.org/abs/1412.6980

  60. [60]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  61. [61]

    InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

    Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, 611–626. doi:10.1145/3600006.3613165

  62. [62]

    Neural Computation , author =

    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub- bard, and L. D. Jackel. 1989. Backpropagation Applied to Handwrit- ten Zip Code Recognition.Neural Computation1, 4 (1989), 541–551. doi:10.1162/neco.1989.1.4.541

  63. [63]

    Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecast- ing GPU Performance for Deep Learning Training and Inference. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 493–508. doi:10.1145/...

  64. [64]

    C. L. Liu and James W. Layland. 1973. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment.Journal of the ACM (JACM)20, 1 (Jan. 1973), 46–61. doi:10.1145/321738.321743

  65. [65]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- anov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Ap- proach.https://arxiv.org/abs/1907.11692

  66. [66]

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A ConvNet for the 2020s. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11966–11976. doi:10.1109/CVPR52688.2022.01167

  67. [67]

    Yadwadkar, and Christos Kozyrakis

    Daniel Mendoza, Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. Interference-Aware Scheduling for Inference Serving. InProceedings of the 1st Workshop on Machine Learning and Systems (EuroMLSys ’21). Association for Computing Machinery, 80–88. doi:10.1145/3437984.3458837

  68. [68]

    Victor Millnert and Johan Eker. 2020. HoloScale: horizontal and vertical scaling of cloud resources. In2020 IEEE/ACM 13th Interna- tional Conference on Utility and Cloud Computing (UCC). 196–205. doi:10.1109/UCC48980.2020.00038

  69. [69]

    Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. 2023. Paella: Low-latency Model Serving with Software-defined GPU Scheduling. InProceedings of the 29th Symposium on Operating Sys- tems Principles (SOSP ’23). Association for Computing Machinery, 595–610. doi:10.1145/3600006.3613163

  70. [70]

    Noghabi, Landon Cox, Sharad Agarwal, and Ganesh Anan- thanarayanan

    Shadi A. Noghabi, Landon Cox, Sharad Agarwal, and Ganesh Anan- thanarayanan. 2020. The Emerging Landscape of Edge Comput- ing.GetMobile: Mobile Comp. and Comm.23, 4 (May 2020), 11–20. doi:10.1145/3400713.3400717 xiii Haidong Zhao and Nikolaos Georgantas

  71. [71]

    Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Ra- jashekhar. 2017. TensorFlow-Serving: Flexible, High-Performance ML Serving. InWorkshop on ML Systems at NIPS 2017

  72. [72]

    Anderson

    Nathan Otterness and James H. Anderson. 2020. AMD GPUs as an Alternative to NVIDIA for Supporting Real-Time Workloads. In 32nd Euromicro Conference on Real-Time Systems (ECRTS 2020), Mar- cus Völp (Ed.), Vol. 165. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 10:1–10:23. doi:10.4230/LIPIcs.ECRTS.2020.10

  73. [73]

    Arthi Padmanabhan, Neil Agarwal, Anand Iyer, Ganesh Anantha- narayanan, Yuanchao Shu, Nikolaos Karianakis, Guoqing Harry Xu, and Ravi Netravali. 2023. Gemel: Model Merging for Memory- Efficient, Real-Time Video Analytics at the Edge. In20th USENIX Sym- posium on Networked Systems Design and Implementation (NSDI 23). 973–994.https://www.usenix.org/conferen...

  74. [74]

    Ning Qian. 1999. On the momentum term in gradient descent learning algorithms.Neural Networks12, 1 (1999), 145–151. doi:10.1016/S0893- 6080(98)00116-6

  75. [75]

    On the Convergence of Adam and Beyond

    Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. 2019. On the Convergence of Adam and Beyond.https://arxiv.org/abs/1904.09237

  76. [76]

    Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B

    Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kan...

  77. [77]

    InProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA ’20)

    MLPerf inference benchmark. InProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA ’20). IEEE Press, 446–459. doi:10.1109/ISCA45697.2020.00045

  78. [78]

    Deloitte Research. 2020. Milliseconds Make Millions. https://www.deloitte.com/ie/en/services/consulting/research/ milliseconds-make-millions.html

  79. [79]

    Yadwadkar, and Christos Kozyrakis

    Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397–411. https://www.usenix.org/conference/atc21/presentation/romero

  80. [80]

    Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms.https://www.ruder.io/optimizing-gradient-descent

Showing first 80 references.