pith. sign in

arxiv: 2606.21712 · v1 · pith:G6R76UWKnew · submitted 2026-06-19 · 💻 cs.DC · cs.LG

BatchGen: An Architecture for Scalable and Efficient Batch Inference

Pith reviewed 2026-06-26 12:56 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords batch inferencesequence coroutinesGPU cluster utilizationdynamic work allocationstraggler mitigationmemory-constrained accelerators
0
0 comments X

The pith

Representing sequences as event-driven coroutines lets batch inference systems reorganize work at runtime for higher GPU utilization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing inference engines, built for interactive serving, cannot handle the extreme load variation that appears only when batch workloads scale to millions of sequences. It proposes the sequence coroutine compute model, in which each sequence becomes a fine-grained event-driven coroutine, as a new architectural foundation. This model supplies runtime primitives that support dynamic batching, straggler mitigation, cross-device reallocation, and continued operation on memory-limited hardware. If the model works as described, systems built on it can sustain higher utilization across large clusters and cheaper accelerators, directly cutting batch completion time.

Core claim

The sequence coroutine compute model represents each sequence as a fine-grained, event-driven coroutine. This abstraction exposes primitives that allow the runtime to reorganize work dynamically, enabling larger expert-level batches, mitigating stragglers, reallocating work across devices, and maintaining utilization even on cost-effective or memory-constrained GPUs.

What carries the argument

The sequence coroutine compute model, which turns each sequence into an event-driven coroutine so the runtime can reorganize computation on the fly.

If this is right

  • Larger expert-level batches become feasible without straggler bottlenecks.
  • Work can be reallocated across devices at runtime to keep all GPUs busy.
  • Utilization remains high on memory-constrained or lower-cost accelerators.
  • Batch completion time drops by up to 2.3 imes on 128-GPU clusters.
  • Outperformance versus offloading baselines reaches up to 9.6 imes on memory-limited hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coroutine primitives could be applied to other variable-workload distributed tasks such as large-scale simulation or data processing.
  • Production inference engines might adopt the model incrementally by wrapping existing kernels rather than rewriting entire schedulers.
  • The approach could lower the hardware requirements for high-throughput batch jobs by extracting more performance from commodity GPUs.

Load-bearing premise

The observed speedups stem primarily from the coroutine primitives rather than from unstated implementation choices, workload selection, or hardware tuning.

What would settle it

A controlled run on the same 128-GPU cluster and workloads that turns off the dynamic reorganization primitives while retaining all other optimizations and measures whether the 2.3 imes and 9.6 imes gains disappear.

Figures

Figures reproduced from arXiv: 2606.21712 by Congjie He, Hongyang Xiao, Jinfu Deng, Le Xu, Leyang Xue, Luo Mai, Matej Sandor, Tairan Xu, Yinsicheng Jiang, Zhan Lu.

Figure 1
Figure 1. Figure 1: New system design goals for batch inference. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Challenges in MoE batch inference. of-Experts (MoE), as the primary mechanism for scaling model capacity. Shown in Figure 2a, modern models such as DeepSeek-R1 [11], Kimi-K2 [22], GPT-5 [38], Gemini 3 Pro [41], and Grok [52] all deploy MoE models with hun￾dreds of experts. The MoE layers contain the majority of the model parameters and contribute a large portion of the compute required for each token. Sinc… view at source ↗
Figure 3
Figure 3. Figure 3: Event-driven Sequence Coroutine Architecture. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Coroutine primitives used on sequence. to resume execution correctly at the next scheduling point. Se￾quence computation exposes primitives (i.e. yield, combine, partition, migrate) to enable coroutine scheduling on se￾quence computation, and callbacks for customized operations on sequence states. Creating coroutines through module wrappers. We pro￾vide an abstraction layer that preserves the modular seman… view at source ↗
Figure 4
Figure 4. Figure 4: Sequence coroutine abstraction. 4 The Sequence Coroutine Model 4.1 The Sequence Coroutine Abstraction We abstract sequence coroutines as follows: a representa￾tion of a neural network model’s per-sequence execution that can be paused, migrated, combined, partitioned, and resumed without losing correctness. Establishing this abstraction requires answering three key questions: (i) How to model the sequence’s… view at source ↗
Figure 6
Figure 6. Figure 6: Yield point selection. to achieve better utilization for batch progression (e.g. switch from decode to prefill to refill sequence) or under resource pressure (e.g. evict sequence under growing decode length that exceeds GPU memory). The yielded sequence can be combined again later with other active sequences running on GPU. Because these conditions depend on runtime state, inter-forward yield points are de… view at source ↗
Figure 7
Figure 7. Figure 7: Memory layout and execution flow for prefill (top) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: BatchGen system architecture with master-worker [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Batch inference has become a central mode of AI computation, yet existing inference engines still rely on execution models designed for interactive serving. When scaled to millions of sequences, batch workloads reveal two fundamental requirements: the ability to handle extreme inter- and intra-sequence load variation that emerges only at runtime, and the ability to sustain high utilization across large fleets of GPUs. Existing systems fail to meet these requirements, losing substantial fractions of achievable throughput. We introduce a new architectural foundation for batch inference: the sequence coroutine compute model, which represents each sequence as a fine-grained, event-driven coroutine. This model exposes expressive primitives that allow the runtime to reorganize work dynamically, enabling larger expert-level batches, mitigating stragglers, reallocating work across devices, and maintaining utilization even on cost-effective or memory-constrained GPUs. Building on this abstraction, we implement BatchGen, a production-ready system that uses the coroutine model at cluster scale. On a 128-GPU cluster, BatchGen reduces batch completion time by up to $2.3\times$, and on memory-constrained accelerators it outperforms the strongest offloading baseline by up to $9.6\times$. We will open-source BatchGen at https://github.com/batchgen-project/batchgen

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents BatchGen, a system for scalable and efficient batch inference based on the sequence coroutine compute model. This model treats each sequence as a fine-grained, event-driven coroutine, exposing primitives for dynamic work reorganization to handle inter- and intra-sequence load variation, mitigate stragglers, reallocate work across devices, and maintain high utilization on various GPUs including memory-constrained ones. On a 128-GPU cluster, it claims up to 2.3× reduction in batch completion time, and up to 9.6× better performance than the strongest offloading baseline on memory-constrained accelerators. The system is to be open-sourced.

Significance. Should the reported performance improvements be validated through detailed experiments, this work would offer a significant advancement in the design of batch inference systems for large-scale AI workloads. By shifting from interactive-serving-oriented models to one optimized for batch processing with high variability, it addresses key inefficiencies in current systems. The open-source release would be particularly valuable for the community.

major comments (1)
  1. [Abstract] Abstract: The abstract reports concrete speedups (2.3× on 128-GPU cluster and 9.6× vs offloading baseline) but provides no details on experimental setup, baselines, error bars, workload characteristics, or measurement methodology. As noted in the review materials, the central performance claims cannot be evaluated from the given information, undermining the ability to assess the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the sequence coroutine model for batch inference. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports concrete speedups (2.3× on 128-GPU cluster and 9.6× vs offloading baseline) but provides no details on experimental setup, baselines, error bars, workload characteristics, or measurement methodology. As noted in the review materials, the central performance claims cannot be evaluated from the given information, undermining the ability to assess the contribution.

    Authors: We agree that the abstract, in its current form, is concise and omits explicit details on experimental setup, baselines, workloads, error bars, and methodology. This is a valid observation, as abstracts are intentionally brief. The full manuscript (Sections 5 and 6) provides the requested information, including the 128-GPU cluster configuration, workload characteristics (sequence length distributions and batch sizes drawn from production traces), comparison baselines (including the strongest offloading system), measurement methodology (end-to-end batch completion time with wall-clock timing), and reporting of results with variability. To directly address the concern, we will revise the abstract in the next version to include a short clause summarizing the evaluation setting (e.g., 'evaluated on a 128-GPU cluster using diverse LLM inference workloads against state-of-the-art baselines'). We believe this change will allow readers to better contextualize the reported speedups without exceeding typical abstract length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces the sequence coroutine compute model as an architectural foundation and reports empirical speedups (2.3× on 128-GPU cluster, 9.6× vs offloading baseline) from system implementation and measurements. No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. All central claims are performance statements grounded in experiments rather than any reduction to inputs by construction, self-definition, or load-bearing self-citation chains. The derivation chain is therefore self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unelaborated assumption that the coroutine model can be realized efficiently at scale.

pith-pipeline@v0.9.1-grok · 5781 in / 1230 out tokens · 14309 ms · 2026-06-26T12:56:30.777855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 1 linked inside Pith

  1. [1]

    Process multiple prompts with batch in- ference

    Amazon AWS. Process multiple prompts with batch in- ference. https://docs.aws.amazon.com/bedrock/ latest/userguide/batch-inference.html, 2025. Accessed: 2025-12-02

  2. [2]

    L-eval: Instituting standardized evaluation for long context language models

    Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. InProceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 14388–14411, 2024

  3. [3]

    Apache HTTP server project

    Apache HTTP Server Project Members. Apache HTTP server project. https://httpd.apache.org/, 2025. Accessed: 2025-12-06

  4. [4]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InACL (1), pages 3119–3137. Association for Computational Linguistics, 2024

  5. [5]

    Beyond the imitation game: Quan- tifying and extrapolating the capabilities of language models.Trans

    BIG-bench authors. Beyond the imitation game: Quan- tifying and extrapolating the capabilities of language models.Trans. Mach. Learn. Res., 2023, 2023

  6. [6]

    verl: V olcano Engine Reinforce- ment Learning for LLMs

    Bytedance Seed. verl: V olcano Engine Reinforce- ment Learning for LLMs . https://github.com/ volcengine/verl, 2025. Accessed: 2025-12-09

  7. [7]

    Gonzalez, Matei Za- haria, and Ion Stoica

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xi- aoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Za- haria, and Ion Stoica. MoE-Lightning: High-throughput moe inference on memory-constrained gpus. InASP- LOS (1), pages 715–730. ACM, 2025

  8. [8]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory- efficient exact attention with io-awareness. InNeurIPS, 2022

  9. [9]

    Deepseek AI. DeepEP. https://github.com/ deepseek-ai/DeepEP, 2025. Accessed: 2025-12-06

  10. [10]

    DeepGEMM

    Deepseek AI. DeepGEMM. https://github.com/ deepseek-ai/DeepGEMM, 2025. Accessed: 2025-12- 06

  11. [11]

    DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning. 2025

  12. [12]

    DeepSpeed

    DeepSpeed Team. DeepSpeed. https://github.com/ deepspeedai/DeepSpeed, 2025. Accessed: 2025-12- 09

  13. [13]

    Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache, 2025

    Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, and Mao Yang. Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache, 2025

  14. [14]

    Eltabakh, Zan Ahmad Naeem, Moham- mad Shahmeer Ahmad, Mourad Ouzzani, and Nan Tang

    Mohamed Y . Eltabakh, Zan Ahmad Naeem, Moham- mad Shahmeer Ahmad, Mourad Ouzzani, and Nan Tang. RetClean: Retrieval-based tabular data cleaning using llms and data lakes.Proc. VLDB Endow., 17(12):4421– 4424, 2024

  15. [15]

    ServerlessLLM: Low-latency serverless inference for large language models

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. ServerlessLLM: Low-latency serverless inference for large language models. InOSDI. USENIX Association, 2024

  16. [16]

    RollPacker: Mitigating long-tail rollouts for fast, synchronous RL post-training

    Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. RollPacker: Mitigating long-tail rollouts for fast, synchronous RL post-training. 2025

  17. [17]

    OpenRLHF: A ray-based easy-to-use, scalable and high-performance rlhf frame- work

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Wenkai Fang, et al. OpenRLHF: A ray-based easy-to-use, scalable and high-performance rlhf frame- work. InProceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing: System Demonstrations, pages 656–666, 2025

  18. [18]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon An- toniak, T...

  19. [19]

    NEO: saving GPU memory crisis with CPU offloading for online LLM inference, 2024

    Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. NEO: saving GPU memory crisis with CPU offloading for online LLM inference, 2024

  20. [20]

    MoE-CAP: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems

    Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, et al. MoE-CAP: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems. Advances in Neural Information Processing Systems, 38, 2026

  21. [21]

    Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models, 2024

    Keisuke Kamahori, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models, 2024

  22. [22]

    Kimi K2: open agentic intelligence, 2025

    Kimi Team. Kimi K2: open agentic intelligence, 2025

  23. [23]

    Efficient memory manage- ment for large language model serving with pagedatten- tion

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InSOSP, pages 611–626. ACM, 2023

  24. [24]

    Accelerating distributed MoE training and inference with Lina

    Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed MoE training and inference with Lina. InUSENIX Annual Technical Con- ference, pages 945–959. USENIX Association, 2023

  25. [25]

    Holistic eval- uation of language models.Trans

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, and et al. Holistic eval- uation of language models.Trans. Mach. Learn. Res., 2023, 2023

  26. [26]

    Janus: A unified distributed training framework for sparse Mixture-of-Experts models

    Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: A unified distributed training framework for sparse Mixture-of-Experts models. InSIGCOMM, pages 486–

  27. [27]

    On LLMs-driven synthetic data generation, curation, and evaluation: A survey

    Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On LLMs-driven synthetic data generation, curation, and evaluation: A survey. InACL (Findings), pages 11065–11082. Associ- ation for Computational Linguistics, 2024

  28. [28]

    The streaming batch model for efficient and fault-tolerant heteroge- neous execution, 2024

    Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Char- lotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, and Stephanie Wang. The streaming batch model for efficient and fault-tolerant heteroge- neous execution, 2024

  29. [29]

    Taming hyper- parameters in deep learning systems.ACM SIGOPS Operating Systems Review, 53(1):52–58, 2019

    Luo Mai, Alexandros Koliousis, Guo Li, Andrei- Octavian Brabete, and Peter Pietzuch. Taming hyper- parameters in deep learning systems.ACM SIGOPS Operating Systems Review, 53(1):52–58, 2019

  30. [30]

    KungFu: Making training in distributed machine learn- ing adaptive

    Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. KungFu: Making training in distributed machine learn- ing adaptive. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 937–954. USENIX Association, November 2020

  31. [31]

    To- wards efficient generative large language model serving: A survey from algorithms to systems.arXiv preprint arXiv:2312.15234, 2023

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. To- wards efficient generative large language model serving: A survey from algorithms to systems.arXiv preprint arXiv:2312.15234, 2023

  32. [32]

    Batch Endpoints

    Microsoft Azure. Batch Endpoints. https://learn. microsoft.com/en-us/azure/machine-learning/ concept-endpoints-batch?view=azureml-api-2 ,

  33. [33]

    Accessed: 2025-12-02

  34. [34]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Eli- bol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerg- ing AI applications. InOSDI, pages 561–577. USENIX Association, 2018

  35. [35]

    NGINX open source

    NGINX Team. NGINX open source. https://nginx. org/index.html, 2025. Accessed: 2025-12-06

  36. [36]

    TensorRT-LLM

    NVIDIA. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM, 2024. Accessed: 2024-05-17

  37. [37]

    Ollama. Ollama. https://github.com/ollama/ ollama, 2025

  38. [38]

    Batch API

    OpenAI. Batch API. https://platform.openai. com/docs/guides/batch, 2025. Accessed: 2025-12- 02

  39. [39]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, OpenAI, August 2025. Accessed: 2025-12-11

  40. [40]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training lan- guage models to follow instructions with human f...

  41. [41]

    Splitwise: Efficient generative LLM inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. InISCA, pages 118–132. IEEE, 2024

  42. [42]

    A new era of intelligence with gem- ini 3

    Sundar Pichai. A new era of intelligence with gem- ini 3. https://blog.google/products/gemini/ gemini-3/#note-from-ceo , November 2025. Ac- cessed: 2025-12-11

  43. [43]

    Mooncake: Trading more storage for less computation - A kvcache-centric architecture for serving LLM chatbot

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation - A kvcache-centric architecture for serving LLM chatbot. InFAST, pages 155–170. USENIX Association, 2025

  44. [44]

    Laradji, Parmida Atighehchian, David Vázquez, and Dzmitry Bahdanau

    Gaurav Sahu, Pau Rodríguez, Issam H. Laradji, Parmida Atighehchian, David Vázquez, and Dzmitry Bahdanau. Data augmentation for intent classification with off-the- shelf large language models. InConvAI@ACL, pages 47–

  45. [45]

    Association for Computational Linguistics, 2022

  46. [46]

    FlexGen: High- throughput generative inference of large language mod- els with a single GPU

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christo- pher Ré, Ion Stoica, and Ce Zhang. FlexGen: High- throughput generative inference of large language mod- els with a single GPU. InICML, volume 202 ofPro- ceedings of Machine Learning Research, pages 31094– 31116. PMLR, 2023

  47. [47]

    PowerInfer: Fast large language model serving with a consumer-grade GPU, 2023

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. PowerInfer: Fast large language model serving with a consumer-grade GPU, 2023

  48. [48]

    Obando-Ceron, Yoshua Bengio, Brian R

    Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan S. Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume La- joie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Re- cursive self-aggregation unlocks deep thinking in large language models, 2025

  49. [49]

    Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

    Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InPro- ceedings of the ACM SIGOPS 30th Symposium on Op- erating Systems Principles, pages 195–210, 2024

  50. [50]

    GEAR: A GPU-centric experience replay system for large reinforcement learning models

    Hanjing Wang, Man-Kit Sit, Congjie He, Ying Wen, Weinan Zhang, Jun Wang, Yaodong Yang, and Luo Mai. GEAR: A GPU-centric experience replay system for large reinforcement learning models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceed- ings of the 40th International Conference on Ma...

  51. [51]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR. Open- Review.net, 2023

  52. [52]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-Thought prompting elicits reasoning in large language models. InNeurIPS, 2022

  53. [53]

    Culler, and Eric A

    Matt Welsh, David E. Culler, and Eric A. Brewer. SEDA: an architecture for well-conditioned, scalable internet services. InSOSP, pages 230–243. ACM, 2001

  54. [54]

    Grok 4 model card

    xAI. Grok 4 model card. Technical report, xAI, August

  55. [55]

    Accessed: 2025-12-11

  56. [56]

    Qwen2.5-omni technical report, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Jun- yang Lin. Qwen2.5-omni technical report, 2025

  57. [57]

    Moe-infinity: Activation-aware expert offloading for efficient moe serving.arXiv preprint arXiv:2401.14361, 3, 2024

    Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Ma- hesh Marina. Moe-infinity: Activation-aware expert offloading for efficient moe serving.arXiv preprint arXiv:2401.14361, 3, 2024

  58. [58]

    vllm-omni: Fully disaggregated serving for any-to-any multimodal models, 2026

    Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, Didan Deng, Zifeng Mo, Cong Wang, James Cheng, Roger Wang, and Hong- sheng Liu. vllm-omni: Fully disaggregated serving for any-to-any multimodal models, 2026

  59. [59]

    Orca: A distributed serving system for transformer-based generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. InOSDI, pages 521–538. USENIX Association, 2022

  60. [60]

    Resilient dis- tributed datasets: A {Fault-Tolerant} abstraction for {In- Memory} cluster computing

    Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient dis- tributed datasets: A {Fault-Tolerant} abstraction for {In- Memory} cluster computing. In9th USENIX symposium on networked systems design and implementation (NSDI 12), pages 15–28, 2012

  61. [61]

    SmartMoE: Efficiently training sparsely-activated models through combining offline and online parallelization

    Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Run- qing Zhang, and Jidong Zhai. SmartMoE: Efficiently training sparsely-activated models through combining offline and online parallelization. InUSENIX Annual Technical Conference, pages 961–975. USENIX Asso- ciation, 2023

  62. [62]

    Data cleaning using large language models

    Shuo Zhang, Zezhou Huang, and Eugene Wu. Data cleaning using large language models. InICDEW, pages 28–32. IEEE, 2025

  63. [63]

    Blendserve: Optimizing offline inference with resource-aware batching

    Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yifan Qiao, Yang Zhou, Jiarong Xing, and Ion Stoica. Blendserve: Optimizing offline inference with resource-aware batching. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’26, pages 255–273, New ...

  64. [64]

    Xing, Joseph E

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. InOSDI, pages 559–578. USENIX Association, 2022

  65. [65]

    Gonzalez, Clark W

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. InNeurIPS, 2024

  66. [66]

    Dist- Serve: Disaggregating prefill and decoding for goodput- optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist- Serve: Disaggregating prefill and decoding for goodput- optimized large language model serving. InOSDI, pages 193–210. USENIX Association, 2024

  67. [67]

    A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Ji- aming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhi- hang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024

  68. [68]

    Nanoflow: Towards optimal large language model serving throughput

    Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. Nanoflow: Towards optimal large language model serving throughput. In OSDI, pages 749–765. USENIX Association, 2025

  69. [69]

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. MegaScale- Infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism. InSIG...