pith. sign in

arxiv: 2606.09643 · v1 · pith:GN75ZA6Qnew · submitted 2026-06-08 · 💻 cs.DC · cs.AI· cs.LG· cs.OS

FMplex: Model Virtualization for Serving Extensible Foundation Models

Pith reviewed 2026-06-27 14:48 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.OS
keywords model servingfoundation modelsvirtualizationmodel sharingtask customizationbatch schedulinglatency reductionresource efficiency
0
0 comments X

The pith

FMplex virtualizes foundation model backbones so customized tasks can share one instance while keeping their own extensions and isolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to treat a foundation model as a shared virtualization substrate instead of replicating the full backbone for every downstream task. Each task receives a virtual foundation model backed by the same physical one, which keeps task-specific changes, separate lifecycles, and isolation intact. A batch-aware fair-queueing scheduler then mixes weighted sharing with inter- and intra-task batching to improve efficiency. Experiments across seven backbones and ninety-two tasks report large gains in latency and task density over both spatial partitioning and simple co-location. A reader would care because foundation models are expensive to run and current serving approaches waste memory and compute by duplicating them.

Core claim

FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. A batch-aware fair-queueing scheduler combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.

What carries the argument

The virtual foundation model (vFM) abstraction backed by a shared physical FM, together with the batch-aware fair-queueing scheduler that mixes weighted sharing and batching.

If this is right

  • Tasks can start, stop, or update independently without reloading or duplicating the shared backbone.
  • Batching and loading costs are amortized across many tasks instead of being paid per instance.
  • Accelerator memory holds many more active tasks because only one copy of the heavyweight backbone is needed.
  • Cluster operators can increase served task count without adding proportional hardware.
  • Task isolation remains at the individual-task level even though the backbone is shared.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sharing approach could let operators add or remove tasks dynamically without restarting the shared model.
  • Energy use per task could fall because fewer full model copies run in parallel.
  • Task developers might begin designing extensions to take advantage of sharing rather than assuming a private full model.
  • The pattern might apply to other large shared components such as embedding tables or feature extractors in production pipelines.

Load-bearing premise

The added virtualization layer and scheduler can deliver the reported latency and density gains without new interference or overhead that would cancel the benefits of sharing.

What would settle it

A measurement showing that average per-task latency or total tasks per accelerator drops to the level of spatial partitioning once task extensions or fairness constraints are enforced at scale.

Figures

Figures reproduced from arXiv: 2606.09643 by David Irwin, Hetvi Shastri, Mani Srivastava, Pragya Sharma, Prashant Shenoy, Walid A. Hanafy.

Figure 1
Figure 1. Figure 1: Benefits of FM sharing in terms of memory demand and throughput across a number of tasks and modalities. model will typically use a task-specific head (e.g., a classifier head) and can further fine-tune the model using parameter￾efficient fine-tuning approaches [39]. Despite the multi-task nature of foundation models, conventional model-serving systems, such as NVIDIA Triton [45], are still built around ta… view at source ↗
Figure 2
Figure 2. Figure 2: Comparing (a) the instance-per-task approach, where each task loads its own FM and the backbone is repli￾cated, with (b) our FM virtualization approach, where each task is presented with a virtual FM (vFM) backed by a shared physical FM, enabling deployment sharing. FM, their inference requests execute on the same model in￾stance and may be batched together, increasing the risk of cross-task interference. … view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of an FM-based task pipeline featur￾ing a decoder head, optional encoder and fine-tuning adapter. schedulers achieve similar fairness at only 37 RPS. At clus￾ter scale, FMplex hosts up to 6× more tasks than current co-location approaches at low load, where memory is the binding constraint, and 8–12% more at moderate and high load, where compute is the binding constraint. 2 Background This sect… view at source ↗
Figure 4
Figure 4. Figure 4: depicts the overview of FMplex. At a high level, FMplex decouples each task’s logical view of the founda￾tion model from its physical substrate. Analogous to a Hy￾pervisor [70] or Containers [37, 40], FMplex presents each task with a virtual foundation model (vFM) and multiplexes many vFMs over a single shared physical FM. FMplex com￾prises three components that jointly realize R1 – R4 . The vFM abstractio… view at source ↗
Figure 5
Figure 5. Figure 5: BFQ behavior under different scenarios [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: End-to-end serving stack on top of FMplex. and FMplex-Controller to support task deployment, rout￾ing, and adaptation across a cluster. 5.1 Overview The mechanisms in Section 4 define how a single server virtu￾alizes shared FM execution through vFMs, task-local queues, and BFQ [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparing FMplex when serving two tasks using Moment-Large ECG and gesture classification tasks across scheduling approaches. 1 5 10 15 20 RPS/task 0 117 233 350 Mean Latency (ms) ST SP BE FMplex (a) DINOv2-Base 2 4 6 8 10 RPS/task 0 67 133 200 Mean Latency (ms) ST SP BE FMplex (b) Swin-Large [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparing FMplex when serving two tasks across models. 7.2.1 Benefits of FM-sharing on Performance. We first demonstrate the latency benefits of FM sharing rel￾ative to the deployment baselines BE and SP, and quan￾tify the sharing overhead against the per-task latency un￾der no sharing (i.e., ST) [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of Customization cumulative cost of 10 backbone replicas exceeds the 16 GB VRAM budget. Similarly, at 7 RPS per task, FMplex’s mean latency grows sublinearly from 33 ms at 𝑁 = 2 to 148 ms at 𝑁 = 10, while achieving 79% lower latency than BE at 𝑁 = 8, the maximum it can run as it reaches the memory limit. The same scaling behavior holds across modalities ( [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Noisy neighbor experiment with weight (3:1) showing throughput and fairness. Client A, the high-priority client, starts at 5 RPS, spikes to 500 RPS, and then returns to 5 RPS, a pattern common in serverless and event-driven systems [60]. We compare FMplex against BE, SP, S-BE, and S-STFQ. Figure 13a shows how each method responds over time to Client A’s burst. We omit BE and S-BE for clarity. SP limits Cl… view at source ↗
Figure 15
Figure 15. Figure 15: Number of tasks the cluster can host across ap￾proaches and load profiles (low, moderate, high). 7.4.1 Cluster-Scale Latency [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Adaptation latency after a workload surge in FMplex and BE. MOMENT-Large Papageip DINOv2-Base Swin-Large 0 10 20 30 40 Service time (ms) 22.4 8.9 18.7 30.6 23.2 9.0 19.0 30.8 ST FMplex [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: FMplex scheduling overhead. both replicas. This path completes in 500 ms and produces only a transient increase in latency. In BE, there is no back￾bone sharing, so the system must start a new MOMENT￾Large instance before it can shift load3 . This start-backbone path waits until the new backbone is ready, around 58 s after the workload change. During this interval, mean latency rises by roughly two orders… view at source ↗
Figure 18
Figure 18. Figure 18: CDF across request rates for MOMENT-Large ( [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: CDF across request rates for DINOv2-Base (Figure 8a) 17 [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: CDF across request rates for Swin-Large (Figure 8b) 18 [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗
read the original abstract

Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents FMplex, a serving system for foundation models that introduces virtual foundation models (vFM) as a virtualization abstraction allowing multiple independently customized downstream tasks to share a physical FM backbone while preserving task-specific extensions, independent lifecycles, and isolation. It also proposes a batch-aware fair-queueing scheduler enabling inter- and intra-task batching. The system is implemented as a full serving stack, and evaluation across 7 FM backbones (16 variants) and 92 downstream tasks reports latency reductions of up to 80% versus spatial partitioning and 33.3% versus best-effort co-location, plus the ability to host up to 6x more tasks at cluster scale.

Significance. If the empirical results hold after addressing measurement gaps, the work would be significant for distributed ML serving: it directly targets memory waste and batching under-utilization when deploying many task-specific FM variants, offering a practical path to higher density without sacrificing per-task customization. The virtualization substrate idea and combined scheduler are novel contributions in the model-serving literature.

major comments (2)
  1. [Evaluation section (results on 7 backbones / 92 tasks)] The strongest claims (80% latency reduction, 33.3% improvement over co-location, 6x task density) are load-bearing on the assertion that vFM virtualization and the batch-aware scheduler introduce negligible interference or overhead. The manuscript provides no dedicated overhead breakdown (e.g., context-switch cost, memory-mapping overhead, or batching-efficiency loss due to isolation) in the evaluation; without such quantification relative to the reported gains, it is impossible to confirm the net benefit.
  2. [Scheduler design and runtime execution sections] The scheduler description claims weighted task-level sharing combined with inter/intra-task batching, but the manuscript does not show how the fair-queueing policy interacts with task-specific extensions or isolation enforcement; if isolation prevents full batch merging, the latency and density claims would be undermined. A concrete example or micro-benchmark isolating this interaction is needed.
minor comments (2)
  1. [Abstract] The abstract states performance numbers without error bars, number of runs, or exact workload characteristics; adding these in the evaluation tables would improve clarity.
  2. [Introduction / System overview] Notation for vFM and the physical FM mapping could be formalized earlier (e.g., with a small diagram or equations) to aid readers unfamiliar with virtualization concepts in ML serving.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of FMplex. We address each major comment below and will revise the manuscript accordingly to strengthen the evaluation.

read point-by-point responses
  1. Referee: [Evaluation section (results on 7 backbones / 92 tasks)] The strongest claims (80% latency reduction, 33.3% improvement over co-location, 6x task density) are load-bearing on the assertion that vFM virtualization and the batch-aware scheduler introduce negligible interference or overhead. The manuscript provides no dedicated overhead breakdown (e.g., context-switch cost, memory-mapping overhead, or batching-efficiency loss due to isolation) in the evaluation; without such quantification relative to the reported gains, it is impossible to confirm the net benefit.

    Authors: We agree that a dedicated overhead breakdown is needed to fully substantiate the negligible-interference claim. In the revised manuscript we will add micro-benchmarks that quantify context-switch cost, memory-mapping overhead, and any batching-efficiency loss attributable to isolation, presented relative to the end-to-end gains already reported. revision: yes

  2. Referee: [Scheduler design and runtime execution sections] The scheduler description claims weighted task-level sharing combined with inter/intra-task batching, but the manuscript does not show how the fair-queueing policy interacts with task-specific extensions or isolation enforcement; if isolation prevents full batch merging, the latency and density claims would be undermined. A concrete example or micro-benchmark isolating this interaction is needed.

    Authors: The scheduler batches requests at the shared physical backbone before task-specific extensions are applied, allowing inter-task batching while isolation is maintained via separate extension layers. We will add both a concrete scheduling example and an isolating micro-benchmark to the revised scheduler section to demonstrate this interaction explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with independent benchmarks

full rationale

The paper describes an implemented serving system (FMplex) with vFM virtualization and a batch-aware fair-queueing scheduler, then reports measured latency reductions (up to 80% vs spatial partitioning, 33.3% vs best-effort co-location) and task density gains (up to 6x) from running 92 downstream tasks on 7 FM backbones. These outcomes are presented as direct results of the prototype evaluation rather than any derivation, fitted parameter, or self-citation chain that reduces the numbers to the inputs by construction. No equations, uniqueness theorems, ansatzes, or renamings appear in the provided text; the central claims rest on external benchmark data and are therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available; therefore the ledger is limited to the core abstractions explicitly named.

axioms (1)
  • domain assumption Foundation model backbones can be extended for downstream tasks while the core weights remain shareable without task interference
    Required for the vFM abstraction to deliver both sharing and task-specific extensions.
invented entities (1)
  • virtual foundation model (vFM) no independent evidence
    purpose: Logically private FM instance backed by a shared physical FM
    New abstraction introduced to enable sharing while preserving extensions and isolation.

pith-pipeline@v0.9.1-grok · 5769 in / 1280 out tokens · 29619 ms · 2026-06-27T14:48:51.841124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 17 canonical work pages

  1. [1]

    Friedman, Thomas Williams, Ramesh K

    Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitaraman, and Thomas Woo. 2024. Proteus: A High- Throughput Inference-Serving System with Accuracy Scaling. InPro- ceedings of the 29th ACM International Conference on Architectural Sup- port for Programming Languages and Operating Systems, Volume 1(La Jolla, CA, USA)(ASPLOS ’24). 318–...

  2. [2]

    Amazon Web Services. 2026. Amazon Bedrock.https://aws.amazon. com/bedrock/. Accessed: 2026-05-14

  3. [3]

    Maddix, Michael W

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Syndar Ranga- puram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. 2024. Chronos: Learning the Language of...

  4. [4]

    Joshua Bakita and James H Anderson. 2023. Hardware Compute Partitioning on NVIDIA GPUs. InProceedings of the 29th IEEE Real- Time and Embedded Technology and Applications Symposium. 54–66

  5. [5]

    Charith Chandra Sai Balne, Sreyoshi Bhaduri, Tamoghna Roy, Vinija Jain, and Aman Chadha. 2024. Parameter Efficient Fine Tuning: A Com- prehensive Analysis Across Applications. arXiv:2404.13506 [cs.LG] https://arxiv.org/abs/2404.13506

  6. [6]

    Ozan Baris, Yizhuo Chen, Gaofeng Dong, Liying Han, Tomoyoshi Kimura, Pengrui Quan, Ruijie Wang, Tianchen Wang, Tarek Ab- delzaher, Mario Bergés, Paul Pu Liang, and Mani Srivastava. 2025. Foundation Models for CPS-IoT: Opportunities and Challenges. arXiv:2501.16368 [cs.LG]https://arxiv.org/abs/2501.16368

  7. [7]

    Rishi Bommasani et al . 2021. On the Opportunities and Risks of Foundation Models.ArXiv(2021).https://crfm.stanford.edu/assets/ report.pdf

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  9. [9]

    Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, and Ion Stoica. 2025. Locality-aware Fair Scheduling in LLM Serving. arXiv:2501.14312 [cs.DC]https://arxiv.org/abs/2501. 14312

  10. [10]

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2024. Punica: Multi-Tenant LoRA Serving. In Proceedings of Machine Learning and Systems (MLSys)

  11. [11]

    Franklin, Joseph E

    Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX As- sociation, Boston, MA, 613–627.https://www.usenix.org/conference/ nsdi17/technical-sessions/presentatio...

  12. [12]

    R. J. Creasy. 1981. The Origin of the VM/370 Time-Sharing System. IBM Journal of Research and Development25, 5 (1981), 483–490. doi:10. 1147/rd.255.0483

  13. [13]

    Demers, S

    A. Demers, S. Keshav, and S. Shenker. 1989. Analysis and Simulation of a Fair Queueing Algorithm.SIGCOMM Comput. Commun. Rev.19, 4 (aug 1989), 1–12. doi:10.1145/75247.75248

  14. [14]

    Mazurowski

    Haoyu Dong, Hanxue Gu, Yaqian Chen, Jichen Yang, Yuwen Chen, and Maciej A. Mazurowski. 2024. Segment anything model 2: an application to 2D and 3D medical images. arXiv:2408.00756 [cs.CV] 13 Shastri et al. https://arxiv.org/abs/2408.00756

  15. [15]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR).ht...

  16. [16]

    R. Elliott. 2002. A measure of fairness of service for scheduling algo- rithms in multiuser systems. InIEEE CCECE2002. Canadian Confer- ence on Electrical and Computer Engineering. Conference Proceedings (Cat. No.02CH37373), Vol. 3. 1583–1588 vol.3. doi:10.1109/CCECE.2002. 1012991

  17. [17]

    Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, and Iev- gen Redko. 2025. Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification.arXiv preprint arXiv:2502.15637(2025)

  18. [18]

    Théo Gnassounou, Yessin Moakher, Shifeng Xie, Vasilii Feofanov, and Ievgen Redko. 2025. Leveraging Generic Time Series Foundation Models for EEG Classification. arXiv:2510.27522 [cs.LG]https://arxiv. org/abs/2510.27522

  19. [19]

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. 2024. MOMENT: A Family of Open Time-series Foundation Models. InInternational Conference on Machine Learning

  20. [20]

    Vin, and Haichen Cheng

    Pawan Goyal, Harrick M. Vin, and Haichen Cheng. 1997. Start-Time Fair Queueing: A Scheduling Algorithm for Integrated Services Packet Switching Networks.IEEE/ACM Trans. Netw.5, 5 (oct 1997), 690–704. doi:10.1109/90.649569

  21. [21]

    gRPC Authors. 2026. gRPC: A high performance open-source universal RPC framework.https://grpc.io/. Accessed: 2026-03-26

  22. [22]

    Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 443–462.https://www.usenix.org/ conference/osdi20/presentation/gujarati

  23. [23]

    Daya Guo, Dejian Yang, et al. 2025. DeepSeek-R1 incentivizes reason- ing in LLMs through Reinforcement Learning.Nature645, 8081 (Sept. 2025), 633–638. doi:10.1038/s41586-025-09422-z

  24. [24]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  25. [25]

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaud- huri and Ruslan Sal...

  26. [26]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models.ICLR1, 2 (2022), 3

  27. [27]

    Nan Huang, Haishuai Wang, Zihuai He, Marinka Zitnik, and Xi- ang Zhang. 2025. Repurposing Foundation Model for Generaliz- able Medical Time Series Classification. arXiv:2410.03794 [cs.LG] https://arxiv.org/abs/2410.03794

  28. [28]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

  29. [29]

    Anuj Kumar, Harish Kumar Saravanan, Shivam Dwivedi, and Pan- darasamy Arjunan. 2025. MixForecast: Mixer-Enhanced Foundation Model for Load Forecasting. InProceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Inter- net of Things(Irvine, CA, USA)(FMSys). 25–30. doi:10.1145/3722565. 3727193

  30. [30]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica

  31. [31]

    InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

    Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  32. [32]

    Gonzalez, and Ion Stoica

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663–679

  33. [33]

    Hanafy, Ahmed Ali-Eldin, and Prashant Shenoy

    Qianlin Liang, Walid A. Hanafy, Ahmed Ali-Eldin, and Prashant Shenoy. 2023. Model-Driven Cluster Resource Management for AI Workloads in Edge Clouds.ACM Transactions on Autonomous and Adaptive Systems18, 1, Article 2 (mar 2023), 26 pages. doi:10.1145/ 3582080

  34. [34]

    Hanafy, Noman Bashir, David Irwin, and Prashant Shenoy

    Qianlin Liang, Walid A. Hanafy, Noman Bashir, David Irwin, and Prashant Shenoy. 2023. Energy Time Fairness: Balancing Fair Alloca- tion of Energy and Time for GPU Workloads. In2023 IEEE/ACM Sym- posium on Edge Computing (SEC). 53–66. doi:10.1145/3583740.3628435

  35. [35]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. InProceedings of the 37th International Con- ference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Article 1516, 25 pages

  36. [36]

    Shikun Liu, Edward Johns, and Andrew J. Davison. 2019. End-To-End Multi-Task Learning With Attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  37. [37]

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030 [cs.CV] https://arxiv.org/abs/2103.14030

  38. [38]

    LXC Contributors. 2026. LXC: Linux Containers.https:// linuxcontainers.org/. Accessed: 2026-04-14

  39. [39]

    Diptyaroop Maji, Kang Yang, Prashant Shenoy, Ramesh K Sitaraman, and Mani Srivastava. 2025. CarbonX: An Open-Source Tool for Com- putational Decarbonization Using Time Series Foundation Models. arXiv:2510.01521 [cs.LG]https://arxiv.org/abs/2510.01521

  40. [40]

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, Benjamin Bossan, and Marian Tietz. 2022. PEFT: State-of- the-art Parameter-Efficient Fine-Tuning methods.https://github.com/ huggingface/peft

  41. [41]

    Dirk Merkel. 2014. Docker: Lightweight Linux Containers for Consis- tent Development and Deployment.Linux Journal2014, 239 (2014)

  42. [42]

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al . 2024. Gemma: Open Models Based on Gemini Research and Technology.arXiv preprint arXiv:2403.08295(2024)

  43. [43]

    Meta. 2024. Llama 3.2 Vision models.https://www.llama.com/docs/ model-cards-and-prompt-formats/llama3_2/. Accessed: 2026-03-05

  44. [44]

    Nathan Ng, Abel Souza, Ahmed Ali-Eldin, David Irwin, Don Towsley, and Prashant Shenoy. 2024. TailClipper: Reducing Tail Response Time of Distributed Services Through System-Wide Scheduling. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC ’24). 398–414. doi:10.1145/3698038.3698554

  45. [45]

    David Nigenda, Zohar Karnin, Muhammad Bilal Zafar, Raghu Rame- sha, Alan Tan, Michele Donini, and Krishnaram Kenthapadi. 2022. Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA)(KDD ’22). Associati...

  46. [46]

    NVIDIA. 2024. Triton Inference Server.https://developer.nvidia.com/ triton-inference-serverAccessed: 2025-04-13

  47. [47]

    2026.CUDA Driver API: Green Contexts

    NVIDIA Corporation. 2026.CUDA Driver API: Green Contexts. NVIDIA Corporation.https://docs.nvidia.com/cuda/cuda-driver-api/group_ _CUDA__GREEN__CONTEXTS.htmlAccessed: 2026-03-27

  48. [48]

    NVIDIA Corporation. 2026. NVIDIA Multi-Instance GPU (MIG).https: //www.nvidia.com/en-us/technologies/multi-instance-gpu/Accessed: 2026-03-27

  49. [49]

    2026.NVIDIA Multi-Process Service

    NVIDIA Corporation. 2026.NVIDIA Multi-Process Service. NVIDIA Corporation.https://docs.nvidia.com/deploy/mps/index.html

  50. [50]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Fran- cisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wo- jciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick...

  51. [51]

    Parekh and R.G

    A.K. Parekh and R.G. Gallager. 1993. A generalized processor sharing approach to flow control in integrated services networks: the single- node case.IEEE/ACM Transactions on Networking1, 3 (1993), 344–357. doi:10.1109/90.234856

  52. [52]

    2019.PyTorch: an imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019.PyTorch: an imperative style, high-p...

  53. [53]

    Arvind Pillai, Dimitris Spathis, Fahim Kawsar, and Mohammad Malekzadeh. 2025. PaPaGei: Open Foundation Models for Optical Physiological Signals. arXiv:2410.20542 [cs.LG]https://arxiv.org/abs/ 2410.20542

  54. [54]

    Popek and Robert P

    Gerald J. Popek and Robert P. Goldberg. 1974. Formal Requirements for Virtualizable Third Generation Architectures.Commun. ACM17, 7 (1974), 412–421. doi:10.1145/361011.361073

  55. [55]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learn- ing Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machi...

  56. [56]

    Varun Rao, Youran Sun, Mahendra Kumar, Tejas Mutneja, Agastya Mukherjee, and Haizhao Yang. 2025. LLMs Meet Finance: Fine- Tuning Foundation Models for the Open FinLLM Leaderboard. arXiv:2504.13125 [cs.CL]https://arxiv.org/abs/2504.13125

  57. [57]

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi

  58. [58]

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 779–788

  59. [59]

    Yadwadkar, and Christos Kozyrakis

    Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serv- ing. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397–411.https://www.usenix.org/conference/ atc21/presentation/romero

  60. [60]

    Senior, and Françoise Beaufays

    Hasim Sak, Andrew W. Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. InINTERSPEECH. 338–342

  61. [61]

    InProceedings of the 22nd International Workshop on Mobile Computing Systems and Applications(Virtual Event, UK)(HotMobile ’21)

    Mahadev Satyanarayanan, Nathan Beckmann, Grace A. Lewis, and Brandon Lucia. 2021. The Role of Edge Offload for Hardware- Accelerated Mobile Devices. InProceedings of the 22nd International Workshop on Mobile Computing Systems and Applications(Virtual, United Kingdom)(HotMobile ’21). 22–29. doi:10.1145/3446382.3448360

  62. [62]

    Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 205...

  63. [63]

    Ao Shen, Zhiyao Li, and Mingyu Gao. 2024. FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving. arXiv:2411.18424 [cs.LG]https://arxiv.org/abs/2411.18424

  64. [64]

    Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada)(SOSP ’19). Asso- ciation for Computing Machinery, New ...

  65. [65]

    2025.EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

    Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025.EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. Association for Computing Machinery, New York, NY, USA, 138–153.https://doi.org/10.1145/ 3711875.3729141

  66. [66]

    Gonzalez, and Ion Stoica

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. 2024. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv:2311.03285 [cs.LG] https://arxiv.org/abs/2311.03285

  67. [67]

    Gonzalez, and Ion Stoica

    Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 965–988. https://www.usenix.org/conference/osdi24/presentation/sheng

  68. [68]

    Yu Shi, Zongliang Fu, Shuo Chen, Bohan Zhao, Wei Xu, Chang- shui Zhang, and Jian Li. 2025. Kronos: A Foundation Model for the Language of Financial Markets. arXiv:2508.02739 [q-fin.ST] https://arxiv.org/abs/2508.02739

  69. [69]

    Shakhrul Iman Siam, Hyunho Ahn, Li Liu, Samiul Alam, Hui Shen, Zhichao Cao, Ness Shroff, Bhaskar Krishnamachari, Mani Srivastava, and Mi Zhang. 2025. Artificial Intelligence of Things: A Survey.ACM Trans. Sen. Netw.21, 1, Article 9 (Jan. 2025), 75 pages. doi:10.1145/ 3690639

  70. [70]

    Luigi Simeone. 2026. Time Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark. arXiv:2602.10848 [cs.LG]https://arxiv.org/abs/2602.10848

  71. [71]

    Sitaraman, and Prashant Shenoy

    Michael Sindelar, Ramesh K. Sitaraman, and Prashant Shenoy. 2011. Sharing-aware algorithms for virtual machine colocation(SPAA ’11). Association for Computing Machinery, New York, NY, USA, 367–378. doi:10.1145/1989493.1989554

  72. [72]

    Smith and Ravi Nair

    J.E. Smith and Ravi Nair. 2005. The architecture of Virtual Machines. Computer38, 5 (2005), 32–38. doi:10.1109/MC.2005.173

  73. [73]

    Trevor Standley, Amir Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. 2020. Which tasks should be learned together in multi-task learning?. InProceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 846, 13 pages

  74. [74]

    Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97). PMLR, 6105–6114.https://proceedings. mlr.press/v97/tan19a.html

  75. [75]

    Grave, and Guillaume Lample

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard 15 Shastri et al. Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]https://arxiv. org/abs...

  76. [76]

    Carl A Waldspurger and William E Weihl. 1994. Lottery scheduling: Flexible Proportional-share Resource Management. InProceedings of the 1st USENIX conference on Operating Systems Design and Implemen- tation. 1–es

  77. [77]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV]https://a...

  78. [78]

    Timothy Wood, Gabriel Tarasuk-Levin, Prashant Shenoy, Peter Desnoyers, Emmanuel Cecchet, and Mark D. Corner. 2009. Mem- ory buddies: exploiting page sharing for smart colocation in virtu- alized data centers.SIGOPS Oper. Syst. Rev.43, 3 (July 2009), 27–36. doi:10.1145/1618525.1618529

  79. [79]

    Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving. In18th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 24)

  80. [80]

    Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. arXiv:2312.12148 [cs.CL]https://arxiv.org/abs/2312.12148

Showing first 80 references.