pith. machine review for the scientific record. sign in

arxiv: 2604.07144 · v1 · submitted 2026-04-08 · 💻 cs.DC

Recognition: unknown

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

Binhang Yuan, Fangcheng Fu, Ran Yan, Taiyi Wang, Wenshuang Li, Youhe Jiang, You Peng

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.DC
keywords LLM servingself-evolving systemsprogram synthesisruntime dynamicspolicy adaptationschedulingautoscaling
0
0 comments X

The pith

Autopoiesis uses LLMs to continuously synthesize and rewrite serving policies from real-time observations, replacing static human designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving systems operate in volatile settings where workloads fluctuate and clusters scale elastically, making fixed scheduling and rescheduling policies unable to track shifting trade-offs between overhead and efficiency. The paper introduces Autopoiesis as an online self-evolving framework in which an LLM observes actual system behavior and generates updated policy code on the fly. This converts policy design from a one-time offline task into a persistent runtime component that adapts without further human input. The evaluation across varied dynamics reports gains of up to 53 percent and 34 percent on average versus prior static systems.

Core claim

Serving policies are no longer static artifacts designed by humans before deployment, but living code that LLMs continuously evolve throughout deployment to navigate runtime trade-offs beyond human design, achieved through an LLM-driven program synthesis workflow that observes real-world behavior and rewrites policy code as conditions change.

What carries the argument

LLM-driven program synthesis workflow that observes runtime dynamics and rewrites serving policy code continuously during operation.

If this is right

  • Optimal scheduling and rescheduling decisions adapt automatically to workload-specific trade-offs that evolve over time.
  • Policy maintenance shifts from periodic human redesign to continuous autonomous evolution integrated into the serving loop.
  • System performance improves measurably, reaching up to 53 percent gains and 34 percent on average across diverse runtime conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce the engineering effort required to tune serving systems for new hardware or traffic patterns.
  • Similar self-synthesis loops might apply to other dynamic resource managers facing intertwined overhead and efficiency choices.
  • Long-term operation could accumulate a growing library of evolved policies that future instances reuse or refine.

Load-bearing premise

An LLM can reliably generate correct, efficient, and safe serving-policy code from real-time observations without introducing bugs, excessive overhead, or decisions that cancel out the performance gains.

What would settle it

A controlled run in which the LLM-synthesized policies produce higher latency, instability, or security violations than the original static baselines under identical workload fluctuations and autoscaling events.

read the original abstract

Modern Large Language Model (LLM) serving operates in highly volatile environments characterized by severe runtime dynamics, such as workload fluctuations and elastic cluster autoscaling. Traditional serving systems rely on static, human-engineered serving policies (e.g., scheduling algorithms and rescheduling strategies) to manage these dynamics. However, these policies must navigate deeply intertwined runtime trade-offs (e.g., scheduling overhead vs. execution efficiency, rescheduling frequency vs. reconfiguration overhead), whose optimal balance is workload-specific and shifts continuously as runtime conditions evolve, rendering any fixed policy fundamentally unable to adapt. We propose Autopoiesis, a novel online self-evolving system that shifts LLM serving from static policy deployment to continuous online policy evolution. First, Autopoiesis introduces an LLM-driven program synthesis workflow to evolve serving policies with respect to real-time observed dynamics, where the evolved policies reflect the optimal decision in navigating the complex, multi-dimensional trade-off space. Second, Autopoiesis enables this synthesis process to operate continuously during serving, observing real-world system behavior, and rewriting the policy code as runtime trade-offs shift, thereby transforming policy design from a one-time offline endeavor into an ongoing system component, enabling autonomous adaptation to evolving runtime conditions. Together, we establish a new paradigm: Serving policies are no longer static artifacts designed by humans before deployment, but living code that LLMs continuously evolve throughout deployment to navigate runtime trade-offs beyond human design. We evaluate Autopoiesis across diverse runtime dynamics and show up to 53% and on average 34% improvements over state-of-the-art LLM serving systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Autopoiesis, a novel online self-evolving system for LLM serving that uses LLM-driven program synthesis to continuously evolve serving policies in response to runtime dynamics such as workload fluctuations and elastic autoscaling. It contrasts this with traditional static, human-engineered policies that cannot adapt to shifting trade-offs. The system enables ongoing policy rewriting during serving to achieve better navigation of multi-dimensional trade-offs. Evaluations are claimed to demonstrate up to 53% and on average 34% improvements over state-of-the-art LLM serving systems across diverse runtime dynamics.

Significance. If the self-evolving mechanism proves reliable, this work could mark a paradigm shift in LLM serving from static to adaptive, living policies, offering substantial efficiency gains in volatile production environments. The approach has the potential to reduce the need for manual policy tuning and improve system performance under changing conditions. However, the significance is currently limited by the lack of detailed evidence and mechanisms in the manuscript.

major comments (3)
  1. [Abstract] The performance claims of up to 53% and average 34% improvements are presented without any supporting details on the experimental setup, including the state-of-the-art systems used as baselines, the specific runtime dynamics tested, workload characteristics, or quantitative evaluation data. This absence makes it impossible to determine whether the gains are due to the proposed self-evolution or other unstated factors.
  2. [LLM-driven program synthesis workflow] The description of the online policy evolution lacks any mention of safeguards such as code verification, sandboxing, bounded execution, or rollback mechanisms to handle potentially incorrect or unsafe code synthesized by the LLM. Since the central claim depends on the LLM reliably producing correct and efficient serving policies from real-time observations, this omission is load-bearing and risks the synthesized policies introducing bugs that negate the reported benefits.
  3. [Paradigm description] While the shift from static to continuously evolving policies is conceptually appealing, the manuscript provides no details on how the synthesis process integrates with the live serving system or manages the overhead of continuous observation and rewriting, which could itself impact performance.
minor comments (2)
  1. [Abstract] The abstract is quite dense with long sentences; splitting some for better flow would enhance readability.
  2. The term 'Autopoiesis' is used without explaining its connection to the biological concept or the specific analogy intended in this context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to provide greater clarity on experimental details, safety mechanisms, and integration overhead while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] The performance claims of up to 53% and average 34% improvements are presented without any supporting details on the experimental setup, including the state-of-the-art systems used as baselines, the specific runtime dynamics tested, workload characteristics, or quantitative evaluation data. This absence makes it impossible to determine whether the gains are due to the proposed self-evolution or other unstated factors.

    Authors: We agree the abstract is concise and omits setup specifics. The full manuscript details these in the Evaluation section, including baselines as state-of-the-art LLM serving systems, runtime dynamics such as workload fluctuations and elastic autoscaling, workload traces, and quantitative results with breakdowns. To make the abstract more self-contained, we will add a brief clause summarizing the evaluation conditions and confirming the gains are measured against those baselines under the described dynamics. revision: yes

  2. Referee: [LLM-driven program synthesis workflow] The description of the online policy evolution lacks any mention of safeguards such as code verification, sandboxing, bounded execution, or rollback mechanisms to handle potentially incorrect or unsafe code synthesized by the LLM. Since the central claim depends on the LLM reliably producing correct and efficient serving policies from real-time observations, this omission is load-bearing and risks the synthesized policies introducing bugs that negate the reported benefits.

    Authors: This is a substantive concern for any LLM-based code generation in production. The original manuscript emphasizes the synthesis workflow but does not explicitly cover reliability safeguards. In revision we will add a dedicated subsection describing: static verification and syntax checking of synthesized policies, sandboxed execution against synthetic workloads prior to live deployment, execution bounds on time and resources, and an automatic rollback to the prior policy version upon detected degradation or errors. We will also report supporting measurements showing these mechanisms maintain system stability. revision: yes

  3. Referee: [Paradigm description] While the shift from static to continuously evolving policies is conceptually appealing, the manuscript provides no details on how the synthesis process integrates with the live serving system or manages the overhead of continuous observation and rewriting, which could itself impact performance.

    Authors: We acknowledge the need for explicit integration and overhead analysis. The manuscript outlines the continuous workflow but we will expand the System Design and Implementation sections to describe non-intrusive metric collection via existing hooks, periodic or event-triggered synthesis, and atomic policy updates. We will add quantitative overhead results (synthesis and rewriting cost as a small fraction of serving time) with new figures demonstrating that net gains remain substantial after accounting for this cost. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of a proposed system, not a derived prediction

full rationale

The paper introduces Autopoiesis as an LLM-driven online policy synthesis system for LLM serving and reports measured improvements (up to 53%, average 34%) from direct evaluation on runtime dynamics. No equations, fitted parameters, or first-principles derivations appear in the abstract or described chain; the performance numbers are presented as experimental outcomes rather than quantities computed from the system's own inputs or prior self-citations. The central claim rests on the feasibility of LLM code synthesis, which is an engineering assumption subject to external verification, not a self-referential reduction. No load-bearing self-citation, ansatz smuggling, or renaming of known results is evident. The derivation chain is therefore self-contained as a system proposal plus empirical results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that LLMs can perform reliable real-time program synthesis for serving policies; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption LLMs can synthesize correct and optimal serving policies from observed runtime dynamics
    This assumption underpins the entire LLM-driven synthesis workflow described in the abstract.
invented entities (1)
  • Autopoiesis self-evolving serving system no independent evidence
    purpose: Continuous online policy evolution for LLM serving
    New system paradigm introduced to replace static policies.

pith-pipeline@v0.9.0 · 5595 in / 1198 out tokens · 34633 ms · 2026-05-10T17:24:45.015259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 27 canonical work pages · 6 internal anchors

  1. [1]

    Burstgpt: A real-world workload dataset to optimize llm serving systems

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5831–5841, 2025

  2. [2]

    Llumnix: Dynamic schedul- ing for large language model serving

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic schedul- ing for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

  3. [3]

    Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

  4. [4]

    {AlpaServe}: Statistical multiplexing with model parallelism for deep learning 18 serving

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning 18 serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

  5. [5]

    Fairness in serving large language models

    Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E Gonzalez, and Ion Stoica. Fairness in serving large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 965–988, 2024

  6. [6]

    {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024

  7. [7]

    {ServerlessLLM}:{Low-Latency} serverless inference for large language models

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. {ServerlessLLM}:{Low-Latency} serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024

  8. [8]

    Aegaeon: Effective gpu pooling for concurrent llm serving on the market

    Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1030–1045, 2025

  9. [9]

    Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024

  10. [10]

    Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

    Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 195–210, 2024

  11. [11]

    Spotserve: Serving generative large language models on preemptible instances

    Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serving generative large language models on preemptible instances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1112–1127, 2024

  12. [12]

    Thunderserve: High-performance and cost-efficient llm serving in cloud environments

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

  13. [13]

    Skyserve: Serving ai models across regions and clouds with spot instances

    Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

  14. [14]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  15. [15]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022

  16. [16]

    arXiv preprint arXiv:2404.08509 , year=

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T Kalbarczyk, Tamer Ba¸ sar, and Ravishankar K Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction. arXiv preprint arXiv:2404.08509, 2024

  17. [17]

    Kunserve: Parameter-centric memory management for efficient memory overloading handling in llm serving

    Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, and Haibo Chen. Kunserve: Parameter-centric memory management for efficient memory overloading handling in llm serving. arXiv preprint arXiv:2412.18169, 2024

  18. [18]

    Mooncake: A kvcache-centric disaggregated architecture for llm serving

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving. ACM Transactions on Storage, 2024

  19. [19]

    Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment,

    Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

  20. [20]

    Mathematical discoveries from program search with large language models

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024. 19

  21. [21]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery .arXiv preprint arXiv:2506.13131, 2025

  22. [22]

    Evox: Meta-evolution for automated discovery

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026

  23. [23]

    Openevolve: An open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: An open-source evolutionary coding agent, 2025

  24. [24]

    Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta, 2025

    Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta. arXiv preprint arXiv:2512.23236, 2025

  25. [25]

    Barbarians at the gate: How AI is upending systems research

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research. arXiv preprint arXiv:2510.06189, 2025

  26. [26]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary , Raul Puri, Patrick LeGresley , Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  27. [27]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

  28. [28]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

  29. [29]

    Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pages 18332–18346. PMLR, 2022

  30. [30]

    Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

    Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization. arXiv preprint arXiv:2510.14150, 2025

  31. [31]

    Genetic programming as a means for programming computers by natural selection

    John R Koza. Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4(2):87–112, 1994

  32. [32]

    Evolution of kernels: Automated risc-v kernel optimization with large language models

    Siyuan Chen, Zhichao Lu, and Qingfu Zhang. Evolution of kernels: Automated risc-v kernel optimization with large language models. arXiv preprint arXiv:2509.14265, 2025

  33. [33]

    & Clune, J

    Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909, 2015

  34. [34]

    How SwissAI uses OpenTela for scalable LLM serving

    Xiaozhe Yao. How SwissAI uses OpenTela for scalable LLM serving. Xiaozhe Yao (Blog), March 2026. Accessed: 2026-03-16

  35. [35]

    Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

    You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

  36. [36]

    Distributed genetic algorithms for function optimization

    Reiko Tanese. Distributed genetic algorithms for function optimization. University of Michigan, 1989

  37. [37]

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed

    Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363, 2024

  38. [38]

    Demystifying cost-efficiency in LLM serving over heterogeneous GPUs.arXiv preprint arXiv:2502.00722, 2025

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

  39. [39]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 20

  40. [40]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

  41. [41]

    Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider

    Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry , Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20), pages 205–218, 2020

  42. [42]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025

  43. [43]

    Theory of linear and integer programming

    Alexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998

  44. [44]

    Efficient mixed- precision large language model inference with turbomind

    Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

  45. [45]

    Thinking short and right over thinking long: Serving llm reasoning efficiently and accurately .arXiv preprint arXiv:2505.13326, 2025

    Yuhang Wang, Youhe Jiang, Bin Cui, and Fangcheng Fu. Thinking short and right over thinking long: Serving llm reasoning efficiently and accurately .arXiv preprint arXiv:2505.13326, 2025

  46. [46]

    Lmdeploy: A toolkit for compressing, deploying, and serving llm, 2023

    LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm, 2023

  47. [47]

    Cascadia: An efficient cascade serving system for large language models

    Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

  48. [48]

    Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization.arXiv preprint arXiv:2602.10729, 2026

    Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

  49. [49]

    OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

    Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

  50. [50]

    Parallax: Efficient llm inference service over decentralized environment

    Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

  51. [51]

    Efficient multi-round llm inference over disaggregated serving

    Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

  52. [52]

    Splitwise: Efficient generative llm inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

  53. [53]

    Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023

  54. [54]

    Galvatron: Efficient transformer training over multiple gpus using automatic parallelism

    Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022

  55. [55]

    Osdp: Optimal sharded data parallel for distributed deep learning

    Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022

  56. [56]

    Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus

    HE Guoliang, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  57. [57]

    Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

    Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

  58. [58]

    Hexiscale: Accommodating large language model training over heterogeneous environment

    Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. Hexiscale: Accommodating large language model training over heterogeneous environment. arXiv preprint arXiv:2409.01143, 2024. 21 Table 5Notations. Symbol Description i∈ {1, . . . , N} Timestamp index;Nis the total number of timestamps. σi Scheduling plan at timestampi, specifying ...