pith. machine review for the scientific record. sign in

arxiv: 2604.26881 · v1 · submitted 2026-04-29 · 💻 cs.DC · cs.LG

Recognition: unknown

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:28 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords Mixture-of-ExpertsServerless computingFunction-as-a-ServiceMulti-tenant model servingInference optimizationResource efficiency
0
0 comments X

The pith

FaaSMoE runs Mixture-of-Experts models by invoking only the active experts as stateless serverless functions, using under one third the resources of a full-model baseline in multi-tenant settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FaaSMoE as a way to serve large Mixture-of-Experts models without keeping every expert resident in memory at all times. It splits the model so that experts become independent stateless functions that a FaaS platform can start and stop on demand. In multi-tenant workloads this lets unused experts release their resources immediately for other tenants or requests. The design also lets operators group several experts inside one function to reduce the number of separate calls. Evaluation on Qwen1.5-moe-2.7B under concurrent tenant traffic shows total resource consumption falling below one third of the amount needed when the entire model stays loaded.

Core claim

FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. It further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. A prototype built on an open-source edge-oriented FaaS platform and tested with Qwen1.5-moe-2.7B under multi-tenant workloads uses less than one third of the resources required by a full-model baseline.

What carries the argument

Stateless FaaS functions that host individual experts or small groups of experts, invoked on demand with scale-to-zero behavior and configurable grouping to control invocation frequency.

If this is right

  • Only the experts that activate for a given input consume memory and compute, so idle capacity can be reclaimed immediately.
  • Multiple tenants can share the same pool of functions without each tenant provisioning a full copy of the model.
  • Operators can choose expert grouping size inside each function to balance invocation cost against the ability to scale individual experts independently.
  • The system supports scale-to-zero for inactive experts, which is especially useful when request patterns are bursty or sparse across tenants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition into stateless functions could be applied to other sparsely activated neural architectures beyond standard MoE.
  • In memory-constrained edge or IoT deployments the approach might allow models that exceed device RAM when fully loaded.
  • Routing logic that decides which experts to call could itself be moved into the FaaS control plane to avoid maintaining a central model copy.

Load-bearing premise

The added latency and overhead of launching FaaS functions for each expert or group stays small enough that the resource savings are not erased under realistic patterns of multi-tenant requests.

What would settle it

An experiment that measures total memory and CPU hours across many concurrent tenants while logging per-request latency and finds that end-to-end response time rises enough to offset the measured reduction below one-third resource use.

Figures

Figures reproduced from arXiv: 2604.26881 by David Bermbach, Minghe Wang, Mohammadreza Malekabbasi, Trever Schirmer.

Figure 1
Figure 1. Figure 1: Architecture Overview: FaaSMoE decouples MoE view at source ↗
Figure 2
Figure 2. Figure 2: Deployment Strategies: Baseline is the default de view at source ↗
Figure 3
Figure 3. Figure 3: Average total CPU and Memory Usage among dif view at source ↗
Figure 5
Figure 5. Figure 5: Average CPU and Memory Usage among different view at source ↗
read the original abstract

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes FaaSMoE, a serverless multi-tenant MoE serving architecture that decouples control and execution by deploying experts as stateless FaaS functions with configurable granularity. A prototype is implemented on an open-source edge-oriented FaaS platform and evaluated with Qwen1.5-moe-2.7B under multi-tenant workloads, claiming resource consumption less than one third of a full-model baseline.

Significance. If the resource savings are robust, the work demonstrates a practical serverless approach to mitigating memory underutilization in MoE models, especially valuable in multi-tenant settings. The configurable granularity mechanism and explicit separation of planes are useful design contributions that could inform scalable inference systems.

major comments (2)
  1. [Evaluation] Evaluation section: the central claim that FaaSMoE uses less than one third of the resources versus the full-model baseline is presented without details on the resource metric (memory, CPU, or aggregate), baseline implementation, workload generation and characteristics, number of trials, or statistical reporting. This absence prevents verification of the result and directly undermines the soundness of the empirical contribution.
  2. [§3] Architecture and overhead discussion (around §3): the paper does not quantify or bound the invocation overheads (cold starts, activation transfer, routing without full model residency) under the multi-tenant patterns used in the evaluation. Because the <1/3 savings claim requires these costs to remain below the memory reduction threshold, the lack of such analysis is load-bearing for the main result.
minor comments (1)
  1. [Abstract and Implementation] The abstract and implementation section should explicitly name the open-source FaaS platform used for the prototype to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional detail will improve the verifiability and soundness of our empirical claims. We address each major comment below and will incorporate the requested information in the revised manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central claim that FaaSMoE uses less than one third of the resources versus the full-model baseline is presented without details on the resource metric (memory, CPU, or aggregate), baseline implementation, workload generation and characteristics, number of trials, or statistical reporting. This absence prevents verification of the result and directly undermines the soundness of the empirical contribution.

    Authors: We agree that the evaluation section requires substantially more detail to support the central <1/3 resource claim. The current manuscript reports aggregate results from the prototype but omits the requested specifics. In the revision we will expand the evaluation section to explicitly define the primary resource metric (peak resident memory with secondary CPU utilization), describe the full-model baseline (all experts kept resident with no FaaS invocation), detail workload generation (multi-tenant request traces with varying tenant counts, Poisson arrivals, and Qwen-compatible input sequences), state the number of trials (10 independent runs per configuration), and include statistical reporting (means, standard deviations, and 95% confidence intervals). These additions will enable readers to reproduce and assess the robustness of the reported savings. revision: yes

  2. Referee: [§3] Architecture and overhead discussion (around §3): the paper does not quantify or bound the invocation overheads (cold starts, activation transfer, routing without full model residency) under the multi-tenant patterns used in the evaluation. Because the <1/3 savings claim requires these costs to remain below the memory reduction threshold, the lack of such analysis is load-bearing for the main result.

    Authors: We acknowledge that the absence of quantitative overhead analysis is a limitation that affects the strength of the main result. While §3 describes the design choices (stateless FaaS experts, configurable granularity, and plane separation) intended to keep invocation costs low, it does not provide measured bounds under the multi-tenant workloads. In the revision we will add a dedicated overhead subsection with empirical measurements of cold-start latency, activation transfer time, and routing cost for the evaluated workloads. We will also present a net-resource analysis showing that these overheads remain below the memory savings threshold, thereby supporting the <1/3 claim. If the existing prototype logs are insufficient, we will conduct additional targeted experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation independent of self-referential inputs

full rationale

The paper is a systems proposal describing FaaSMoE architecture, prototype implementation on an edge FaaS platform, and direct empirical comparison of resource usage against a full-model baseline using Qwen1.5-moe-2.7B under multi-tenant workloads. No equations, parameter fits, predictions, or derivation steps appear. Claims rest on measured outcomes rather than any self-definition, fitted-input renaming, or self-citation chain that reduces the central result to its own inputs by construction. This is the expected non-finding for an implementation-and-evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that FaaS invocation overhead is low enough to preserve MoE efficiency gains and that multi-tenant isolation does not introduce new bottlenecks.

axioms (1)
  • domain assumption FaaS platforms can invoke stateless expert functions with sufficiently low latency and overhead to maintain MoE inference benefits
    Invoked when claiming on-demand scale-to-zero without performance loss.
invented entities (1)
  • FaaSMoE architecture no independent evidence
    purpose: Decouple control and execution planes of MoE via FaaS
    The proposed system itself; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5529 in / 1296 out tokens · 60347 ms · 2026-05-07T10:28:07.253693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Amazon Web Services. [n. d.]. AWS Lambda Pricing. https://aws.amazon.com/ lambda/pricing/

  2. [2]

    Ioana Baldini, Paul Castro, Kerry Chang, Perry Cheng, Stephen Fink, Vatche Ishakian, Nick Mitchel, Vinod Muthusamy, Rodric Rabbah, Aleksander Slominski, and Philippe Suter. 2017. Serverless Computing: Current Trends and Open Problems. InResearch Advances in Cloud Computing. Springer

  3. [3]

    David Bermbach, Abhishek Chandra, Chandra Krintz, Aniruddha Gokhale, Alek- sander Slominski, Lauritz Thamsen, Everton Cavalcante, Tian Guo, Ivona Brandic, and Rich Wolski. 2021. On the Future of Cloud Engineering. InProceedings of the 9th IEEE International Conference on Cloud Engineering (IC2E ’21). 264–275

  4. [4]

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066(2024)

  5. [5]

    Databricks on AWS. 2025. Serverless GPU Compute on Databricks. https: //docs.databricks.com/aws/en/compute/serverless/gpu

  6. [6]

    Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sun- dararaman. 2024. MoEsaic: Shared Mixture of Experts. InProceedings of the 2024 ACM Symposium on Cloud Computing. 434–442

  7. [7]

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

  8. [8]

    In International conference on machine learning

    Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569

  9. [9]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  10. [10]

    En-Ming Huang, Li-Shang Lin, and Chun-Yi Lee. 2025. Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems.arXiv preprint arXiv:2512.16473(2025)

  11. [11]

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

  12. [12]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668(2020)

  13. [13]

    Mengfan Liu, Wei Wang, and Chuan Wu. 2025. Optimizing distributed deploy- ment of mixture-of-experts model inference in serverless computing. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

  14. [14]

    Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, Peng Sun, Zhenhua Han, Tian Tang, Xiaohe Hu, Yanmin Jia, Yan Zhang, et al. 2025. Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving.arXiv preprint arXiv:2509.17863(2025)

  15. [15]

    Tobias Pfandzelter and David Bermbach. 2020. tinyFaaS: A Lightweight FaaS Platform for Edge Environments. InProceedings of the Second IEEE International Conference on Fog Computing (ICFC). IEEE

  16. [16]

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Am- inabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. InInternational conference on machine learning. PMLR, 18332–18346

  17. [17]

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. InProceedings of the international conference for high performance com- puting, networking, storage and analysis. 1–14

  18. [18]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep- speed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3505–3506

  19. [19]

    Trever Schirmer and David Bermbach. 2025. Towards a Testbed for Scalable FaaS Platforms. InProceedings of the 13th IEEE International Conference on Cloud Engineering. IEEE

  20. [20]

    Trever Schirmer, Nils Japke, Sofia Greten, Tobias Pfandzelter, and David Bermbach. 2023. The Night Shift: Understanding Performance Variability of Cloud Serverless Platforms. InProceedings of the 1st Workshop on SErverless Systems, Applications and MEthodologies. ACM

  21. [21]

    Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Ba- tum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In2020 USENIX annual technical conference (USENIX ATC 20). 205–218

  22. [22]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

  23. [23]

    Qwen Team et al. 2024. Qwen2 technical report.arXiv preprint arXiv:2407.10671 2, 3 (2024)

  24. [24]

    Minghe Wang, Trever Schirmer, Tobias Pfandzelter, and David Bermbach. 2023. Lotus: Serverless In-Transit Data Processing for Edge-based Pub/Sub. InProceed- ings of the 4th International Workshop on Edge Systems, Analytics and Networking (Rome, Italy)(EdgeSys ’23). Association for Computing Machinery (ACM), New York, NY, USA. doi:10.1145/3578354.3592869

  25. [25]

    Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, and Ping Zhang. 2025. WDMoE: Wireless dis- tributed mixture of experts for large language models.IEEE Transactions on Wireless Communications(2025)

  26. [26]

    Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. 2025. fMoE: Fine- Grained Expert Offloading for Large Mixture-of-Experts Serving.arXiv preprint arXiv:2502.05370(2025)

  27. [27]

    Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, and Lidan Shou. 2025. FloE: On-the-Fly MoE Inference on Memory-constrained GPU.arXiv preprint arXiv:2505.05950(2025)