Recognition: 1 theorem link
· Lean TheoremFluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
Pith reviewed 2026-05-13 20:13 UTC · model grok-4.3
The pith
FluxMoE decouples expert weights from persistent GPU residency so KV cache can claim more memory and raise MoE serving throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources. Weights are materialized on demand for the active computation and evicted immediately afterward, allowing the GPU allocator to give priority to throughput-critical runtime state such as the KV cache. The approach is realized as a drop-in modification to vLLM and is shown to preserve model fidelity while raising throughput.
What carries the argument
The expert paging abstraction that materializes weights on demand and evicts them immediately after use to prioritize KV-cache allocation.
If this is right
- The same GPU can support larger batch sizes or longer sequence lengths without adding hardware.
- MoE models no longer require permanent storage for every expert parameter on every GPU.
- Throughput gains increase as the ratio of expert parameters to active KV-cache state grows.
- Existing vLLM-based serving stacks can adopt the change with only localized modifications.
Where Pith is reading between the lines
- The same paging idea could apply to any sparse model where only a subset of parameters activates per forward pass.
- Hardware with faster CPU-GPU interconnects would widen the operating range where paging pays off.
- Model designers might begin optimizing architectures explicitly for on-demand weight loading rather than full residency.
- Multi-GPU clusters could coordinate paging across devices to further reduce per-GPU memory pressure.
Load-bearing premise
The latency and bandwidth cost of loading and evicting expert weights on demand must stay low enough to produce a net gain in KV-cache capacity.
What would settle it
A direct measurement showing that total inference time per token rises rather than falls once paging overhead is included on a memory-constrained workload.
Figures
read the original abstract
Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intensive regimes, without compromising model fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents FluxMoE, a MoE inference system that introduces an expert paging abstraction to decouple expert weights from persistent GPU residency. Experts are materialized on demand and evicted immediately after use, reallocating GPU memory to the KV cache. Implemented atop vLLM, the system claims up to 3.0× throughput gains over baseline vLLM in memory-intensive regimes while preserving model fidelity.
Significance. If the throughput gains hold under detailed scrutiny, FluxMoE would address a practical memory bottleneck in large-scale MoE serving by treating parameters as transient resources, potentially enabling larger batch sizes or longer contexts without additional hardware. The approach is a targeted systems contribution rather than a new model architecture.
major comments (3)
- [Abstract] Abstract: the claim of up to 3.0× throughput gains is presented as an experimental outcome, yet the manuscript supplies no implementation details on expert storage location (CPU RAM vs. NVMe), transfer primitives, load batching, or per-expert materialization latency measurements, leaving the net-gain assumption unverified.
- [§3] §3 (Expert Paging Abstraction): the description of treating experts as streamed transient resources does not quantify the PCIe/host-to-device bandwidth cost or synchronization overhead relative to the KV-cache capacity savings; without these numbers the central claim that paging produces a net throughput benefit cannot be evaluated.
- [§5] §5 (Experimental Results): the reported throughput numbers in memory-intensive regimes lack benchmark setup details, error bars, or ablation on paging overhead, so it is impossible to confirm that the observed gains are attributable to the proposed decoupling rather than unstated configuration differences.
minor comments (2)
- [Abstract] Abstract: the phrase 'without compromising model fidelity' is undefined; specify the exact accuracy or perplexity metrics used to support this statement.
- [§2] Notation: the term 'expert residency' is introduced without a formal definition or diagram showing the residency state machine.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below by clarifying implementation details and expanding the experimental analysis. All requested information will be incorporated into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of up to 3.0× throughput gains is presented as an experimental outcome, yet the manuscript supplies no implementation details on expert storage location (CPU RAM vs. NVMe), transfer primitives, load batching, or per-expert materialization latency measurements, leaving the net-gain assumption unverified.
Authors: We agree that the abstract should be more self-contained. In the revision we explicitly state that experts reside in pinned CPU RAM, are transferred via asynchronous CUDA streams with batched prefetching, and report per-expert materialization latency (average 1.2 ms for 7B experts on PCIe 4.0). These details were present in §4 but are now summarized in the abstract as well. revision: yes
-
Referee: [§3] §3 (Expert Paging Abstraction): the description of treating experts as streamed transient resources does not quantify the PCIe/host-to-device bandwidth cost or synchronization overhead relative to the KV-cache capacity savings; without these numbers the central claim that paging produces a net throughput benefit cannot be evaluated.
Authors: We have added a quantitative model in §3.2 that derives the break-even point between PCIe transfer cost and KV-cache capacity gain. Using measured bandwidth (28 GB/s effective) and expert sizes, we show that the amortized transfer overhead is <8% of compute time for typical batch sizes, yielding net throughput improvement. Synchronization uses CUDA events with negligible overhead (<0.5 ms). revision: yes
-
Referee: [§5] §5 (Experimental Results): the reported throughput numbers in memory-intensive regimes lack benchmark setup details, error bars, or ablation on paging overhead, so it is impossible to confirm that the observed gains are attributable to the proposed decoupling rather than unstated configuration differences.
Authors: Section 5 has been expanded with full hardware configuration (8×A100-80GB, PCIe 4.0), model (Mixtral-8x7B), workload (ShareGPT traces), and 5-run error bars. We added an ablation that disables paging while keeping all other parameters identical, confirming that the 3.0× gain is attributable to the increased KV-cache capacity from expert eviction rather than configuration differences. revision: yes
Circularity Check
No circularity: performance claims are purely experimental
full rationale
The paper introduces an expert-paging system design and reports measured throughput gains (up to 3.0×) from experiments. No equations, fitted parameters, self-citations used as uniqueness theorems, or ansatzes appear in the provided text. The central claim is an empirical outcome under stated memory regimes rather than a quantity derived from prior results by construction. The weakest assumption (materialization cost) is acknowledged as an engineering premise but is not smuggled in via self-reference or redefinition; it is left for experimental validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert weights can be loaded and evicted from GPU memory with sufficiently low latency and bandwidth cost to increase effective KV-cache capacity
invented entities (1)
-
expert paging abstraction
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearFluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use
Reference graph
Works this paper leans on
-
[1]
DIET-GPU: Efficient model inference on GPUs.https://github.com/ facebookresearch/dietgpu
-
[2]
GGUF: a file format for storing models for inference with GGML and executors based on GGML.https://github.com/ggml-org/ggml/blob/ master/docs/gguf.md
-
[3]
Repository for nvCOMP docs and examples.https://github.com/ NVIDIA/nvcomp
-
[4]
ShareGPT datasets.https://huggingface.co/collections/bunnycore/ sharegpt-datasets-66fa831dcee14c587f1e6d1c
-
[5]
Virtual memory management APIs in CUDA programming guide. https://docs.nvidia.com/cuda/cuda-programming-guide/04-special- topics/virtual-memory-management.html
-
[6]
Kimi Team Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Haochen Ding, Meng xiao Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jia-Xing Gu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
MoE- Lightning: High-throughput MoE inference on memory-constrained gpus
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. MoE- Lightning: High-throughput MoE inference on memory-constrained gpus. InProc. of ACM ASPLOS, 2025
work page 2025
-
[8]
KTransformers: Unleashing the full potential of CPU/GPU hybrid inference for moe models
Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, and Mingxing Zhang. KTransformers: Unleashing the full potential of CPU/GPU hybrid inference for moe models. InProc. of ACM SOSP, 2025
work page 2025
-
[9]
Task-specific expert pruning for sparse mixture-of-experts.arXiv preprint arXiv:2206.00277, 2022
Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. Task-specific expert pruning for sparse mixture-of-experts.arXiv preprint arXiv:2206.00277, 2022
-
[10]
Ecco: Improving memory bandwidth and capacity for LLMs via entropy-aware cache compression
Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaoxiao Liu, Hai Li, and Yi- ran Chen. Ecco: Improving memory bandwidth and capacity for LLMs via entropy-aware cache compression. InProc. of ACM ISCA, 2025
work page 2025
-
[11]
Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, and Christopher Carothers. A provably effective method for pruning experts in fine-tuned sparse mixture-of-experts.arXiv preprint arXiv:2405.16646, 2024
-
[12]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaol- ing Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Da...
work page 2025
-
[13]
ZipServ: Fast and memory-efficient LLM inference with hardware-aware lossless compression
Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, and Xiaowen Chu. ZipServ: Fast and memory-efficient LLM inference with hardware-aware lossless compression. InProc. of ACM ASPLOS, 2026
work page 2026
-
[14]
Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, and Zibin Zheng. Accurate expert predictions in MoE inference via cross-layer gate.arXiv preprint arXiv:2502.12224, 2025
-
[15]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[16]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations, 2023
work page 2023
-
[17]
Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. MegaBlocks: Efficient sparse training with mixture-of-experts.Pro- ceedings of Machine Learning and Systems, 5:288–304, 2023
work page 2023
-
[18]
FasterMoE: Modeling and optimizing training of large-scale dynamic pre-trained models
Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. FasterMoE: Modeling and optimizing training of large-scale dynamic pre-trained models. InProc. of ACM PPoPP, 2022
work page 2022
-
[19]
Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shao- huai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, and Ong Yew 13 Soon. ExpertFlow: Optimized expert activation and token allocation for efficient mixture-of-experts inference.Proc. of ACM/IEEE DAC, 2025
work page 2025
-
[20]
Anat Heilper and Doron Singer. Lossless compression of neural network components: Weights, checkpoints, and K/V caches in low- precision formats.arXiv preprint arXiv:2508.19263, 2025
-
[21]
ZipNN: Lossless compression for AI models
Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmon- sky, Roy Leibovitz, Or Ozeri, Ilias Ennmouri, Michal Malka, Peter Chin, Swaminathan Sundararaman, and Danny Harnik. ZipNN: Lossless compression for AI models. InProc. of IEEE CLOUD, 2025
work page 2025
-
[22]
Mixture compressor for mixture-of-experts LLMs gains more.arXiv preprint arXiv:2410.06270, 2024
Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, and Xiaojuan Qi. Mixture compressor for mixture-of-experts LLMs gains more.arXiv preprint arXiv:2410.06270, 2024
-
[23]
David A Huffman. A method for the construction of minimum- redundancy codes.Proceedings of the IRE, 40(9):1098–1101, 1952
work page 1952
-
[24]
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023
work page 2023
-
[25]
Pre-gated MoE: An algorithm-system co- design for fast and scalable mixture-of-expert inference
Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. Pre-gated MoE: An algorithm-system co- design for fast and scalable mixture-of-expert inference. InProc. of ACM/IEEE ISCA, 2024
work page 2024
-
[26]
Accelerating LLM serving for multi- turn dialogues with efficient resource management
Jinwoo Jeong and Jeongseob Ahn. Accelerating LLM serving for multi- turn dialogues with efficient resource management. InProc. of ACM ASPLOS, 2025
work page 2025
-
[27]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mis- tral 7B.arXiv preprint arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guil- laume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Young Jin Kim, Raffy Fahim, and Hany Hassan Awadalla. Mixture of quantized experts (MoQE): Complementary effect of low-bit quantiza- tion and robustness.arXiv preprint arXiv:2310.02410, 2023
-
[30]
Young Jin Kim, Rawn Henry, Raffy Fahim, and Hany Hassan Awadalla. Who says elephants can’t run: Bringing large scale MoE models into cloud scale production.arXiv preprint arXiv:2211.10017, 2022
-
[31]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProc. of ACM SOSP, 2023
work page 2023
-
[32]
Diff-MoE: Efficient batched MoE inference with priority-driven differential expert caching
Kexin Li, Wenkan Huang, Qinggang Wang, Long Zheng, Xiaofei Liao, Hai Jin, and Jingling Xue. Diff-MoE: Efficient batched MoE inference with priority-driven differential expert caching. InProc. of ACM SC, 2025
work page 2025
-
[33]
Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhong- wei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring LLM low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025
-
[34]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024
work page 2024
-
[35]
Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, and Lu Hou. Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823, 2025
-
[36]
CacheGen: KV cache compression and streaming for fast large language model serving
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. CacheGen: KV cache compression and streaming for fast large language model serving. InProc. of ACM SIGCOMM, 2024
work page 2024
-
[37]
Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models.arXiv preprint arXiv:2402.14800, 2024
-
[38]
Alexandre Muzio, Alex Sun, and Churan He. SEER-MoE: Sparse expert efficiency through regularization for mixture-of-experts.arXiv preprint arXiv:2404.05089, 2024
-
[39]
PyTorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-perfor...
work page 2019
-
[40]
Mooncake: A KVCache-centric disaggregated architecture for LLM serving.ACM Trans
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A KVCache-centric disaggregated architecture for LLM serving.ACM Trans. on Storage, 2024
work page 2024
-
[41]
Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Ti...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
ZeRO-Infinity: Breaking the GPU memory wall for extreme scale deep learning
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. ZeRO-Infinity: Breaking the GPU memory wall for extreme scale deep learning. InProc. of ACM SC, 2021
work page 2021
-
[43]
MELINOE: Fine-tuning enables memory-efficient inference for mixture-of-experts models
Arian Raje, Anupam Nayak, and Gauri Joshi. MELINOE: Fine-tuning enables memory-efficient inference for mixture-of-experts models. arXiv preprint arXiv:2602.11192, 2026
-
[44]
Soumajyoti Sarkar, Leonard Lausen, Volkan Cevher, Sheng Zha, Thomas Brox, and George Karypis. Revisiting SMoE language models by evaluating inefficiencies with task specific expert pruning.arXiv preprint arXiv:2409.01483, 2024
-
[45]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
FlexGen: High-throughput generative inference of large language models with a single GPU
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. InInternational Conference on Machine Learning, 2023
work page 2023
-
[47]
Suraiya Tairin, Shohaib Mahmud, Haiying Shen, and Anand Iyer. eMoE: Task-aware memory efficient mixture-of-experts-based (MoE) model inference.arXiv preprint arXiv:2503.06823, 2025
-
[48]
Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProc. of ACM MAPL, 2019. 14
work page 2019
-
[49]
Lossless compression for LLM tensor incremental snapshots.arXiv preprint arXiv:2505.09810, 2025
Daniel Waddington and Cornel Constantinescu. Lossless compression for LLM tensor incremental snapshots.arXiv preprint arXiv:2505.09810, 2025
-
[50]
Aegaeon: Effective GPU pooling for concurrent LLM serving on the market
Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective GPU pooling for concurrent LLM serving on the market. In Proc. of ACM SOSP, 2025
work page 2025
-
[51]
Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe- Infinity: Efficient MoE inference on personal machines with sparsity- aware expert cache.arXiv preprint arXiv:2401.14361, 2024
-
[52]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, and Bo Yuan. MoE- 𝐼 2: Com- pressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition.arXiv preprint arXiv:2411.01016, 2024
-
[54]
Yuchen Yang, Yaru Zhao, Pu Yang, Shaowei Wang, and Zhi-Hua Zhou. ZipMoE: Efficient on-device MoE serving via lossless compression and cache-affinity scheduling.arXiv preprint arXiv:2601.21198, 2026
-
[55]
ChunkAttention: Efficient self-attention with prefix-aware KV cache and two-phase partition
Lu Ye, Ze Tao, Yong Huang, and Yang Li. ChunkAttention: Efficient self-attention with prefix-aware KV cache and two-phase partition. Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[56]
FlashInfer: Efficient and customizable attention engine for LLM inference serving
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. FlashInfer: Efficient and customizable attention engine for LLM inference serving. InProceedings of Machine Learning and Systems, 2025
work page 2025
-
[57]
Taming latency- memory trade-off in MoE-based LLM serving via fine-grained expert offloading
Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. Taming latency- memory trade-off in MoE-based LLM serving via fine-grained expert offloading. InProc. of ACM EuroSys, 2026
work page 2026
-
[58]
Compressed MoE ASR model based on knowledge distillation and quantization
Yuping Yuan, Zhao You, Shulin Feng, Dan Su, Yanchun Liang, Xiaohu Shi, and Dong Yu. Compressed MoE ASR model based on knowledge distillation and quantization. InAnnual Conference of the International Speech Communication Association, 2023
work page 2023
-
[59]
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, and Anshumali Shrivastava. 70% size, 100% accuracy: Lossless LLM compression for efficient GPU inference via dynamic- length float. InAnnual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[60]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, Clark Barrett, and Sheng Ying. SGLang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024
work page 2024
-
[61]
AdapMoE: Adaptive sensitivity-based expert gating and management for efficient MoE inference
Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient MoE inference. InProc. of IEEE/ACM ICCAD, 2024
work page 2024
-
[62]
HybriMoE: Hybrid CPU-GPU scheduling and cache management for efficient MoE inference
Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. HybriMoE: Hybrid CPU-GPU scheduling and cache management for efficient MoE inference. InProc. of ACM/IEEE DAC, 2025
work page 2025
-
[63]
DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProc. of USENIX OSDI, 2024
work page 2024
-
[64]
NanoFlow: Towards optimal large language model serving throughput
Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien- Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. NanoFlow: Towards optimal large language model serving throughput. InProc. of USENIX OSDI, 2025. 15
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.