pith. machine review for the scientific record. sign in

arxiv: 2605.01060 · v1 · submitted 2026-05-01 · 💻 cs.DC · cs.LG

Recognition: unknown

SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:11 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords streaming GPU encodingembeddingspartitioned datamemory bounded processingcost modelfault toleranceheterogeneous batches
0
0 comments X

The pith

SURGE streams GPU encoding over heterogeneous partitions at fixed-batch throughput with O(B_min + n_max) memory instead of O(N).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SURGE resolves the conflict between logical data partitioning and GPU efficiency when generating embeddings at scale. It supplies a cost model that forecasts throughput within 2 percent and a memory-safety bound that supports a streaming two-threshold policy. The policy processes partitions without requiring all data in memory at once. This yields identical speed to fixed batching, far lower peak memory, quicker first output, and recovery from interruptions. Tests on 10 million texts across multiple encoders and partition distributions confirm the gains hold.

Core claim

The central claim is that a streaming two-threshold policy, grounded in a cost model (Theorem 1) that predicts throughput within 2 percent and a memory-safety bound (Lemma 3) limiting peak memory to O(B_min + n_max), allows encoding of heterogeneous partitioned data on GPUs to match fixed-batch throughput while using 12.6 times less memory, delivering 68 times faster time-to-first-output, and enabling crash recovery at SuperBatch granularity.

What carries the argument

The streaming two-threshold policy with SuperBatch granularity, enabled by the memory-safety bound and cost model for unified resource-efficient encoding across partitions.

Load-bearing premise

The cost model and memory-safety bound accurately predict throughput and enable safe streaming for the heterogeneous partitions and encoders in the workload.

What would settle it

Run the system on a 100 million text dataset with the same log-normal partition sizes and measure whether peak memory remains near 3 GB while throughput stays within 2 percent of the fixed-batch baseline.

Figures

Figures reproduced from arXiv: 2605.01060 by Ajay Kumar, Deep Narayan Mishra, Rishi Bhatia, Shashank Kapadia, Sujal Reddy Alugubelli, Swapnil Yadav.

Figure 1
Figure 1. Figure 1: SURGE’s key advantage: bounded 𝑂(𝐵min + 𝑛max) memory and 𝑂(1) time-to-first-output (TTFO) vs. fixed￾batch’s 𝑂(𝑁) scaling. At 50M texts, SURGE uses 18.5× less memory and produces output 337× faster. SURGE’s TTFO decreases from 4.5 s at 1M to 3.6 s at 10M+ because model warmup and process initialization are amortized over larger first SuperBatches. Full analysis in §5.10. Contributions. The paper’s primary c… view at source ↗
Figure 3
Figure 3. Figure 3: Surge pipeline architecture. GPU encoding of Su￾perBatch 𝑗+1 overlaps with serialization and upload of Su￾perBatch 𝑗, eliminating I/O stalls. The mini-timeline shows how pipelining hides I/O latency. nvidia-smi) reflects kernel occupancy, not pipeline throughput— the observed ∼10% utilization under SURGE ( view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline overlap. PBP: many small GPU calls with view at source ↗
Figure 5
Figure 5. Figure 5: Throughput comparison across methods. FB-100K view at source ↗
Figure 8
Figure 8. Figure 8: Throughput sensitivity to 𝐵min threshold (𝐵max = 5×𝐵min). Throughput plateaus with diminishing returns; the operating point 𝐵min=100K (arrow) achieves 26,027 texts/s with 89 flushes, 2.5 GB peak memory, and 3.6 s TTFO. Higher thresholds yield marginal throughput gains (500K: +8.3%) but increase TTFO (17.8 s) and memory (3.2 GB). 10K 50K 100K 200K 500K 10−1 100 101 102 Batch size 𝑁 Serialization time (s) Na… view at source ↗
Figure 9
Figure 9. Figure 9: Serialization time (log scale): naive Python list view at source ↗
Figure 10
Figure 10. Figure 10: Scaling analysis from 1M to 50M texts. (a) Both methods achieve comparable throughput. (b) FB-100K memory grows view at source ↗
read the original abstract

We present SURGE, a streaming GPU encoding system deployed in production to generate embeddings for over 800 million texts across 40,000 logical partitions. Production embedding pipelines face a tension between logical data partitioning and efficient GPU utilization: processing each partition independently incurs $P$ inter-process communication (IPC) calls whose overhead limits throughput for compute-light models. Our contributions are analytical: (i) a cost model (Theorem 1) predicting throughput within 2% across three encoders spanning a 15$\times$ parameter range; (ii) a memory-safety bound (Lemma 3) enabling a streaming two-threshold policy with peak memory $O(B_{\min} + n_{\max})$ rather than $O(N)$; and (iii) a $\phi$/CV decision framework characterizing when the pattern applies beyond our workload. The naive fix of batching at fixed size requires $O(N)$ peak memory (32.7 GB at 10M texts; infeasible beyond ~60M on 192 GB nodes), produces no output until all encoding completes, and offers no fault tolerance. SURGE achieves the same throughput with $O(B_{\min} + n_{\max})$ bounded memory (2.6 GB), 68$\times$ faster time-to-first-output, and crash recovery at SuperBatch granularity. On 10M texts with 4 NVIDIA L4 GPUs, SURGE delivers 26,413 texts/s -- matching fixed-batch throughput while using 12.6$\times$ less memory. We validate on bge-base (109M, $d$=768, error 1.3%) and across log-normal $\sigma$ in {1.0, 1.72, 2.5} (speedup invariant within $\pm$3%), and compare against a partition-batched baseline (PB-PBP-LB), against which SURGE retains a 7% throughput edge and 2.5$\times$ faster TTFO. Complementary engineering -- zero-copy Arrow serialization (22-25$\times$ speedup) and async I/O pipelining (up to 93% benefit) -- realizes the design but is not the contribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SURGE, a streaming GPU encoding system for heterogeneous partitioned data in production embedding pipelines. It contributes an analytical cost model (Theorem 1) claimed to predict throughput within 2% across three encoders, a memory-safety bound (Lemma 3) enabling a two-threshold streaming policy with peak memory O(B_min + n_max) instead of O(N), and a phi/CV framework for characterizing applicability. The system is reported to match fixed-batch throughput (26,413 texts/s on 10M texts with 4 L4 GPUs) while using 12.6x less memory (2.6 GB vs 32.7 GB), achieving 68x faster time-to-first-output and crash recovery, with validation across log-normal sigmas {1.0, 1.72, 2.5} and encoders including bge-base.

Significance. If the cost model and memory bound hold as stated, the work addresses a practical tension in large-scale embedding generation by enabling efficient GPU utilization without prohibitive memory or latency costs, while adding fault tolerance. The quantitative results across encoders, distributions, and baselines (including 7% throughput edge over partition-batched) suggest robustness, and the analytical framing (if substantiated with derivations) provides a reusable decision framework beyond the specific workload.

major comments (2)
  1. [Theorem 1] Theorem 1: The cost model is presented as predicting throughput within 2% across encoders and distributions, but the explicit derivation, model equations, parameter definitions, and full error analysis (including per-encoder and per-sigma breakdowns) are not provided in sufficient detail to verify that the 2% figure reflects genuine prediction rather than post-hoc agreement. This is load-bearing for the central analytical contribution.
  2. [Lemma 3] Lemma 3: The memory-safety bound O(B_min + n_max) is central to the streaming policy and the claimed reduction from 32.7 GB to 2.6 GB. The proof assumptions (e.g., on partition size heterogeneity and the two-threshold policy) and any dependence on total dataset size N require explicit expansion to confirm generality beyond the tested 10M-text log-normal cases.
minor comments (2)
  1. The abstract states validation 'within 2%' and 'within ±3%' but does not include a table or figure summarizing exact measured vs. predicted throughput values per encoder and sigma; adding such a summary would improve verifiability.
  2. The phi/CV framework is introduced to characterize applicability beyond tested distributions, but its precise definition and decision thresholds are not elaborated; a short formal statement or pseudocode would clarify its use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and valuable comments. We provide point-by-point responses to the major comments, clarifying the analytical contributions and committing to revisions where additional detail is needed to substantiate the claims.

read point-by-point responses
  1. Referee: [Theorem 1] Theorem 1: The cost model is presented as predicting throughput within 2% across encoders and distributions, but the explicit derivation, model equations, parameter definitions, and full error analysis (including per-encoder and per-sigma breakdowns) are not provided in sufficient detail to verify that the 2% figure reflects genuine prediction rather than post-hoc agreement. This is load-bearing for the central analytical contribution.

    Authors: We acknowledge that the presentation of Theorem 1 could benefit from more explicit detail. The cost model derivation is outlined in Section 4, with the 2% accuracy validated empirically across the three encoders and sigma values in Table 3 and Figure 5. To enable full verification, we will include the complete step-by-step derivation, all model equations, parameter definitions, and a detailed error analysis with per-encoder and per-sigma breakdowns in an expanded appendix in the revised manuscript. revision: yes

  2. Referee: [Lemma 3] Lemma 3: The memory-safety bound O(B_min + n_max) is central to the streaming policy and the claimed reduction from 32.7 GB to 2.6 GB. The proof assumptions (e.g., on partition size heterogeneity and the two-threshold policy) and any dependence on total dataset size N require explicit expansion to confirm generality beyond the tested 10M-text log-normal cases.

    Authors: We agree that the assumptions underlying Lemma 3 should be stated more explicitly. The bound holds under the two-threshold policy where batches are formed to respect memory limits, with n_max being the largest partition size and B_min the minimum batch size, and it is independent of total N due to the streaming nature. We will expand the proof in the appendix to detail the assumptions on heterogeneity (log-normal distributions with the tested sigmas) and the policy, including a discussion of generality to other distributions where partition sizes are bounded. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper's central contributions are an analytical cost model (Theorem 1) that predicts throughput within 2% across encoders and a memory-safety bound (Lemma 3) enabling O(B_min + n_max) streaming, both presented as independent derivations validated on external 10M-text workloads with log-normal distributions and multiple baselines. These do not reduce to fitted parameters by construction, self-citations, or ansatzes imported from prior author work; the phi/CV framework further characterizes applicability without circularity. No load-bearing step matches any enumerated pattern, and the claims rest on mathematical bounds plus empirical matching to fixed-batch throughput.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The central claim rests on the validity of the analytical cost model and memory bound as predictive tools for the streaming policy, plus workload assumptions about partition size variability; no new physical entities postulated.

free parameters (2)
  • B_min
    Minimum batch size threshold in the two-threshold streaming policy, chosen to balance throughput and memory.
  • n_max
    Maximum partition size used in the O(B_min + n_max) memory bound.
axioms (3)
  • domain assumption Throughput can be modeled analytically by Theorem 1 with accuracy within 2% across encoders spanning 15x parameter range.
    Invoked as the basis for the cost model contribution and throughput predictions.
  • domain assumption Memory-safety bound of Lemma 3 enables streaming policy with peak memory O(B_min + n_max) rather than O(N).
    Directly supports the bounded memory claim and comparison to naive fixed-batch approach.
  • domain assumption The phi/CV decision framework characterizes when the streaming pattern applies beyond the tested workload.
    Used to generalize the approach to other partition size distributions.

pith-pipeline@v0.9.0 · 5728 in / 1836 out tokens · 45691 ms · 2026-05-09T18:11:44.939702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out- of-Order Data Processing. InProceedings of VLDB, Vol. 8

  2. [2]

    Apache Software Foundation. 2024. Apache Arrow. https://arrow.apache.org

  3. [3]

    Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications. InProceedings of OSDI

  4. [4]

    Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. 2023. Petals: Collaborative Inference and Fine-tuning of Large Models. InProceedings of ACL: System Demonstrations

  5. [5]

    Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine.IEEE Data Engineering Bulletin38, 4 (2015), 28–38

  6. [6]

    Coffman, Jr., János Csirik, Gábor Galambos, Silvano Martello, and Daniele Vigo

    Edward G. Coffman, Jr., János Csirik, Gábor Galambos, Silvano Martello, and Daniele Vigo. 2013. Bin Packing Approximation Algorithms: Survey and Clas- sification. InHandbook of Combinatorial Optimization. Springer

  7. [7]

    Coffman, Jr., Michael R

    Edward G. Coffman, Jr., Michael R. Garey, and David S. Johnson. 1978. An Ap- plication of Bin-Packing to Multiprocessor Scheduling.SIAM J. Comput.7, 1 (1978), 1–17

  8. [8]

    Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zuber, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines. InProceedings of SoCC

  9. [9]

    Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gon- zalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. InProceedings of NSDI

  10. [10]

    Dask Development Team. 2024. Dask: Library for dynamic task scheduling. https://dask.org

  11. [11]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. InProceedings of NAACL-HLT

  12. [12]

    1971.An Introduction to Probability Theory and Its Applications (2nd ed.)

    William Feller. 1971.An Introduction to Probability Theory and Its Applications (2nd ed.). Vol. 2. Wiley

  13. [13]

    Google Cloud. 2024. Google Cloud Storage: Performance and Latency. https: //cloud.google.com/storage/docs/performance

  14. [14]

    Arpan Gujarati, Reza Karber, Srikanth Kandula, Boris Calder, Peter Bodik, Paramvir Bahl, and Peter Druschel. 2020. Serving DNNs like Clockwork: Per- formance Predictability from the Bottom Up. InProceedings of OSDI

  15. [15]

    Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. InProceedings of ICML

  16. [16]

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. REEF: Fast, Ef- ficient, and Energy-Aware Microsecond-Scale Preemption for Concurrent GPU- Accelerated DNN Inferences. InProceedings of OSDI

  17. [17]

    Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al

  18. [18]

    InProceedings of HPCA

    Applied Machine Learning at Facebook: A Datacenter Infrastructure Per- spective. InProceedings of HPCA

  19. [19]

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. InProceedings of NeurIPS

  20. [20]

    Hugging Face. 2024. Text Embeddings Inference. https://github.com/ huggingface/text-embeddings-inference

  21. [21]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

  22. [22]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of EMNLP

  23. [23]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Mem- ory Management for Large Language Model Serving with PagedAttention. In Proceedings of SOSP

  24. [24]

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, and Ion Sto- ica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. InProceedings of OSDI

  25. [25]

    Zhuoran Liu, Leqi Zou, Xuan Zou, Caihua Wang, Biao Zhang, Da Tang, Bolin Zhu, Yijie Zhu, Peng Wu, Ruiming Wang, and Ping Li. 2022. Monolith: Real Time Recommendation System With Collisionless Embedding Table. InRecSys Workshop

  26. [26]

    Microsoft DeepSpeed Team. 2023. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. https://www. deepspeed.ai/inference/

  27. [27]

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applica- tions. InProceedings of OSDI

  28. [28]

    Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jiadong Nie, Jongsoo Park, et al

  29. [29]

    InProceedings of ISCA

    Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. InProceedings of ISCA

  30. [30]

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. InProceedings of SOSP

  31. [31]

    Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, et al. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems.arXiv preprint arXiv:1906.00091 (2019)

  32. [32]

    NVIDIA. 2023. DALI: Data Loading Library. https://github.com/NVIDIA/DALI

  33. [33]

    NVIDIA. 2023. Triton Inference Server. https://github.com/triton-inference- server/server

  34. [34]

    NVIDIA. 2024. RAPIDS: Open GPU Data Science. https://rapids.ai

  35. [35]

    Christopher Olston, Noah Fiedel, Kirill Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Suresh Ramesh, and Jordan Soyke. 2017. TensorFlow-Serving: Flexible, High-Performance ML Serving. InWorkshop on ML Systems at NeurIPS

  36. [36]

    PyTorch. 2023. TorchServe. https://github.com/pytorch/serve

  37. [37]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of EMNLP-IJCNLP

  38. [38]

    Nils Reimers and Iryna Gurevych. 2020. Making Monolingual Sentence Embed- dings Multilingual using Knowledge Distillation. InProceedings of EMNLP

  39. [39]

    Geet Sethi, Bilge Acun, Berkin Akin, Newsha Mnih, Maxim Naumov, Carole- Jean Wu, and Zhihao Jia. 2022. RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation. InProceedings of AS- PLOS

  40. [40]

    Haichen Shen, Lequn Chen, Yuchen Jin, Lijie Zhao, Bingqing Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. InProceedings of SOSP

  41. [41]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of ICML

  42. [42]

    Chijun Sima, Yao Fu, Man-Kit Sit, Liyi Guo, Xuri Gong, Feng Lin, Junyu Wu, Yongsheng Li, Haidong Rong, Pierre-Louis Aublin, and Luo Bi. 2022. Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Up- date. InProceedings of OSDI

  43. [43]

    Abraham Wald. 1944. On Cumulative Sums of Random Variables.The Annals of Mathematical Statistics15, 3 (1944), 283–296

  44. [44]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training.arXiv preprint arXiv:2212.03533(2022)

  45. [45]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

  46. [46]

    InProceedings of NeurIPS

    MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. InProceedings of NeurIPS

  47. [47]

    Qizhen Weng, Wencong Xiao, Yiwei Yu, Wei Wang, Cheng Wang, Jian He, Yongkun Li, Lixin Zhang, Wei Lin, and Yu Ding. 2023. Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent. In Proceedings of USENIX ATC. 103–117

  48. [48]

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian- Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

  49. [49]

    Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Sched- uling for Deep Learning. InProceedings of OSDI

  50. [50]

    Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic Scaling on GPU Clus- ters for Deep Learning. InProceedings of OSDI

  51. [51]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Gener- ative Models. InProceedings of OSDI. 14 SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data

  52. [52]

    Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. InProceedings of SOSP

  53. [53]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Ying Sheng, Joseph E Gonzalez, Ion Stoica, and Hao Zhang. 2023. Efficiently Programming Large Language Models using SGLang.arXiv preprint arXiv:2312.07104(2023). 15