arxiv: 2604.07144 · v1 · submitted 2026-04-08 · 💻 cs.DC

Recognition: unknown

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

Binhang Yuan, Fangcheng Fu, Ran Yan, Taiyi Wang, Wenshuang Li, Youhe Jiang, You Peng

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM servingself-evolving systemsprogram synthesisruntime dynamicspolicy adaptationschedulingautoscaling

0 comments

The pith

Autopoiesis uses LLMs to continuously synthesize and rewrite serving policies from real-time observations, replacing static human designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving systems operate in volatile settings where workloads fluctuate and clusters scale elastically, making fixed scheduling and rescheduling policies unable to track shifting trade-offs between overhead and efficiency. The paper introduces Autopoiesis as an online self-evolving framework in which an LLM observes actual system behavior and generates updated policy code on the fly. This converts policy design from a one-time offline task into a persistent runtime component that adapts without further human input. The evaluation across varied dynamics reports gains of up to 53 percent and 34 percent on average versus prior static systems.

Core claim

Serving policies are no longer static artifacts designed by humans before deployment, but living code that LLMs continuously evolve throughout deployment to navigate runtime trade-offs beyond human design, achieved through an LLM-driven program synthesis workflow that observes real-world behavior and rewrites policy code as conditions change.

What carries the argument

LLM-driven program synthesis workflow that observes runtime dynamics and rewrites serving policy code continuously during operation.

If this is right

Optimal scheduling and rescheduling decisions adapt automatically to workload-specific trade-offs that evolve over time.
Policy maintenance shifts from periodic human redesign to continuous autonomous evolution integrated into the serving loop.
System performance improves measurably, reaching up to 53 percent gains and 34 percent on average across diverse runtime conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce the engineering effort required to tune serving systems for new hardware or traffic patterns.
Similar self-synthesis loops might apply to other dynamic resource managers facing intertwined overhead and efficiency choices.
Long-term operation could accumulate a growing library of evolved policies that future instances reuse or refine.

Load-bearing premise

An LLM can reliably generate correct, efficient, and safe serving-policy code from real-time observations without introducing bugs, excessive overhead, or decisions that cancel out the performance gains.

What would settle it

A controlled run in which the LLM-synthesized policies produce higher latency, instability, or security violations than the original static baselines under identical workload fluctuations and autoscaling events.

read the original abstract

Modern Large Language Model (LLM) serving operates in highly volatile environments characterized by severe runtime dynamics, such as workload fluctuations and elastic cluster autoscaling. Traditional serving systems rely on static, human-engineered serving policies (e.g., scheduling algorithms and rescheduling strategies) to manage these dynamics. However, these policies must navigate deeply intertwined runtime trade-offs (e.g., scheduling overhead vs. execution efficiency, rescheduling frequency vs. reconfiguration overhead), whose optimal balance is workload-specific and shifts continuously as runtime conditions evolve, rendering any fixed policy fundamentally unable to adapt. We propose Autopoiesis, a novel online self-evolving system that shifts LLM serving from static policy deployment to continuous online policy evolution. First, Autopoiesis introduces an LLM-driven program synthesis workflow to evolve serving policies with respect to real-time observed dynamics, where the evolved policies reflect the optimal decision in navigating the complex, multi-dimensional trade-off space. Second, Autopoiesis enables this synthesis process to operate continuously during serving, observing real-world system behavior, and rewriting the policy code as runtime trade-offs shift, thereby transforming policy design from a one-time offline endeavor into an ongoing system component, enabling autonomous adaptation to evolving runtime conditions. Together, we establish a new paradigm: Serving policies are no longer static artifacts designed by humans before deployment, but living code that LLMs continuously evolve throughout deployment to navigate runtime trade-offs beyond human design. We evaluate Autopoiesis across diverse runtime dynamics and show up to 53% and on average 34% improvements over state-of-the-art LLM serving systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The idea of an LLM continuously rewriting serving policies online is new, but without any described safeguards or evaluation details the performance claims cannot be assessed.

read the letter

The main point is that the authors want to replace static human-written policies in LLM serving with code that an LLM generates and updates on the fly as workloads and cluster conditions change. This moves policy design from a one-time offline step to a live, ongoing process inside the system itself. They lay out the problem clearly: runtime dynamics create shifting trade-offs in scheduling and rescheduling that no fixed policy can handle well over time. The framing of policies as living code that evolves through program synthesis is the clearest new element relative to existing serving systems that rely on static or periodically retrained rules. The abstract does a reasonable job showing why adaptation is needed and why an LLM-driven loop could address it in principle. The soft spots are large and central. No mechanism is described for verifying the synthesized code, sandboxing it, or rolling back bad changes before they affect live traffic. In a production serving system even one incorrect policy rewrite could add overhead or cause instability that cancels out any benefit. The reported gains of 34% on average and up to 53% are stated without experimental setup, baselines, raw data, or even high-level method details, so there is no way to tell whether they come from the self-evolution or from other factors. The core assumption that the LLM will reliably produce correct and efficient policies under real conditions is left unexamined. This work would interest researchers in LLM infrastructure and self-adaptive systems who are already thinking about agentic control of resources. Anyone looking for concrete algorithms, reproducible results, or safety analysis would find it thin. The thinking on the underlying problem is straightforward and engages the literature on serving dynamics, but the lack of grounding means the paper is not ready for peer review. I would not bring it to reading group or cite it until a working implementation with verification steps and actual measurements is added.

Referee Report

3 major / 2 minor

Summary. The paper introduces Autopoiesis, a novel online self-evolving system for LLM serving that uses LLM-driven program synthesis to continuously evolve serving policies in response to runtime dynamics such as workload fluctuations and elastic autoscaling. It contrasts this with traditional static, human-engineered policies that cannot adapt to shifting trade-offs. The system enables ongoing policy rewriting during serving to achieve better navigation of multi-dimensional trade-offs. Evaluations are claimed to demonstrate up to 53% and on average 34% improvements over state-of-the-art LLM serving systems across diverse runtime dynamics.

Significance. If the self-evolving mechanism proves reliable, this work could mark a paradigm shift in LLM serving from static to adaptive, living policies, offering substantial efficiency gains in volatile production environments. The approach has the potential to reduce the need for manual policy tuning and improve system performance under changing conditions. However, the significance is currently limited by the lack of detailed evidence and mechanisms in the manuscript.

major comments (3)

[Abstract] The performance claims of up to 53% and average 34% improvements are presented without any supporting details on the experimental setup, including the state-of-the-art systems used as baselines, the specific runtime dynamics tested, workload characteristics, or quantitative evaluation data. This absence makes it impossible to determine whether the gains are due to the proposed self-evolution or other unstated factors.
[LLM-driven program synthesis workflow] The description of the online policy evolution lacks any mention of safeguards such as code verification, sandboxing, bounded execution, or rollback mechanisms to handle potentially incorrect or unsafe code synthesized by the LLM. Since the central claim depends on the LLM reliably producing correct and efficient serving policies from real-time observations, this omission is load-bearing and risks the synthesized policies introducing bugs that negate the reported benefits.
[Paradigm description] While the shift from static to continuously evolving policies is conceptually appealing, the manuscript provides no details on how the synthesis process integrates with the live serving system or manages the overhead of continuous observation and rewriting, which could itself impact performance.

minor comments (2)

[Abstract] The abstract is quite dense with long sentences; splitting some for better flow would enhance readability.
The term 'Autopoiesis' is used without explaining its connection to the biological concept or the specific analogy intended in this context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to provide greater clarity on experimental details, safety mechanisms, and integration overhead while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] The performance claims of up to 53% and average 34% improvements are presented without any supporting details on the experimental setup, including the state-of-the-art systems used as baselines, the specific runtime dynamics tested, workload characteristics, or quantitative evaluation data. This absence makes it impossible to determine whether the gains are due to the proposed self-evolution or other unstated factors.

Authors: We agree the abstract is concise and omits setup specifics. The full manuscript details these in the Evaluation section, including baselines as state-of-the-art LLM serving systems, runtime dynamics such as workload fluctuations and elastic autoscaling, workload traces, and quantitative results with breakdowns. To make the abstract more self-contained, we will add a brief clause summarizing the evaluation conditions and confirming the gains are measured against those baselines under the described dynamics. revision: yes
Referee: [LLM-driven program synthesis workflow] The description of the online policy evolution lacks any mention of safeguards such as code verification, sandboxing, bounded execution, or rollback mechanisms to handle potentially incorrect or unsafe code synthesized by the LLM. Since the central claim depends on the LLM reliably producing correct and efficient serving policies from real-time observations, this omission is load-bearing and risks the synthesized policies introducing bugs that negate the reported benefits.

Authors: This is a substantive concern for any LLM-based code generation in production. The original manuscript emphasizes the synthesis workflow but does not explicitly cover reliability safeguards. In revision we will add a dedicated subsection describing: static verification and syntax checking of synthesized policies, sandboxed execution against synthetic workloads prior to live deployment, execution bounds on time and resources, and an automatic rollback to the prior policy version upon detected degradation or errors. We will also report supporting measurements showing these mechanisms maintain system stability. revision: yes
Referee: [Paradigm description] While the shift from static to continuously evolving policies is conceptually appealing, the manuscript provides no details on how the synthesis process integrates with the live serving system or manages the overhead of continuous observation and rewriting, which could itself impact performance.

Authors: We acknowledge the need for explicit integration and overhead analysis. The manuscript outlines the continuous workflow but we will expand the System Design and Implementation sections to describe non-intrusive metric collection via existing hooks, periodic or event-triggered synthesis, and atomic policy updates. We will add quantitative overhead results (synthesis and rewriting cost as a small fraction of serving time) with new figures demonstrating that net gains remain substantial after accounting for this cost. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of a proposed system, not a derived prediction

full rationale

The paper introduces Autopoiesis as an LLM-driven online policy synthesis system for LLM serving and reports measured improvements (up to 53%, average 34%) from direct evaluation on runtime dynamics. No equations, fitted parameters, or first-principles derivations appear in the abstract or described chain; the performance numbers are presented as experimental outcomes rather than quantities computed from the system's own inputs or prior self-citations. The central claim rests on the feasibility of LLM code synthesis, which is an engineering assumption subject to external verification, not a self-referential reduction. No load-bearing self-citation, ansatz smuggling, or renaming of known results is evident. The derivation chain is therefore self-contained as a system proposal plus empirical results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that LLMs can perform reliable real-time program synthesis for serving policies; no free parameters or invented physical entities are described.

axioms (1)

domain assumption LLMs can synthesize correct and optimal serving policies from observed runtime dynamics
This assumption underpins the entire LLM-driven synthesis workflow described in the abstract.

invented entities (1)

Autopoiesis self-evolving serving system no independent evidence
purpose: Continuous online policy evolution for LLM serving
New system paradigm introduced to replace static policies.

pith-pipeline@v0.9.0 · 5595 in / 1198 out tokens · 34633 ms · 2026-05-10T17:24:45.015259+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 27 canonical work pages · 6 internal anchors

[1]

Burstgpt: A real-world workload dataset to optimize llm serving systems

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5831–5841, 2025

2025
[2]

Llumnix: Dynamic schedul- ing for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic schedul- ing for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

2024
[3]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

2024
[4]

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning 18 serving

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning 18 serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

2023
[5]

Fairness in serving large language models

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E Gonzalez, and Ion Stoica. Fairness in serving large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 965–988, 2024

2024
[6]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024

2024
[7]

{ServerlessLLM}:{Low-Latency} serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. {ServerlessLLM}:{Low-Latency} serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024

2024
[8]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1030–1045, 2025

2025
[9]

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024

2024
[10]

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 195–210, 2024

2024
[11]

Spotserve: Serving generative large language models on preemptible instances

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serving generative large language models on preemptible instances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1112–1127, 2024

2024
[12]

Thunderserve: High-performance and cost-efficient llm serving in cloud environments

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

2025
[13]

Skyserve: Serving ai models across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

2025
[14]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[15]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022

2022
[16]

arXiv preprint arXiv:2404.08509 , year=

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T Kalbarczyk, Tamer Ba¸ sar, and Ravishankar K Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction. arXiv preprint arXiv:2404.08509, 2024

work page arXiv 2024
[17]

Kunserve: Parameter-centric memory management for efficient memory overloading handling in llm serving

Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, and Haibo Chen. Kunserve: Parameter-centric memory management for efficient memory overloading handling in llm serving. arXiv preprint arXiv:2412.18169, 2024

work page arXiv 2024
[18]

Mooncake: A kvcache-centric disaggregated architecture for llm serving

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving. ACM Transactions on Storage, 2024

2024
[19]

Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment,

Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

work page arXiv 2025
[20]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024. 19

2024
[21]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery .arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review arXiv 2025
[22]

Evox: Meta-evolution for automated discovery

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026

work page arXiv 2026
[23]

Openevolve: An open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: An open-source evolutionary coding agent, 2025

2025
[24]

Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta, 2025

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta. arXiv preprint arXiv:2512.23236, 2025

work page arXiv 2025
[25]

Barbarians at the gate: How AI is upending systems research

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research. arXiv preprint arXiv:2510.06189, 2025

work page arXiv 2025
[26]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary , Raul Puri, Patrick LeGresley , Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[27]

Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019
[28]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review arXiv 2023
[29]

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pages 18332–18346. PMLR, 2022

2022
[30]

Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization. arXiv preprint arXiv:2510.14150, 2025

work page arXiv 2025
[31]

Genetic programming as a means for programming computers by natural selection

John R Koza. Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4(2):87–112, 1994

1994
[32]

Evolution of kernels: Automated risc-v kernel optimization with large language models

Siyuan Chen, Zhichao Lu, and Qingfu Zhang. Evolution of kernels: Automated risc-v kernel optimization with large language models. arXiv preprint arXiv:2509.14265, 2025

work page arXiv 2025
[33]

& Clune, J

Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909, 2015

work page arXiv 2015
[34]

How SwissAI uses OpenTela for scalable LLM serving

Xiaozhe Yao. How SwissAI uses OpenTela for scalable LLM serving. Xiaozhe Yao (Blog), March 2026. Accessed: 2026-03-16

2026
[35]

Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

work page arXiv 2025
[36]

Distributed genetic algorithms for function optimization

Reiko Tanese. Distributed genetic algorithms for function optimization. University of Michigan, 1989

1989
[37]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed

Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363, 2024

work page arXiv 2024
[38]

Demystifying cost-efficiency in LLM serving over heterogeneous GPUs.arXiv preprint arXiv:2502.00722, 2025

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

work page arXiv 2025
[39]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 20

2023
[40]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

2024
[41]

Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider

Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry , Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20), pages 205–218, 2020

2020
[42]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review arXiv 2025
[43]

Theory of linear and integer programming

Alexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998

1998
[44]

Efficient mixed- precision large language model inference with turbomind

Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

work page arXiv 2025
[45]

Thinking short and right over thinking long: Serving llm reasoning efficiently and accurately .arXiv preprint arXiv:2505.13326, 2025

Yuhang Wang, Youhe Jiang, Bin Cui, and Fangcheng Fu. Thinking short and right over thinking long: Serving llm reasoning efficiently and accurately .arXiv preprint arXiv:2505.13326, 2025

work page arXiv 2025
[46]

Lmdeploy: A toolkit for compressing, deploying, and serving llm, 2023

LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm, 2023

2023
[47]

Cascadia: An efficient cascade serving system for large language models

Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

work page arXiv 2025
[48]

Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization.arXiv preprint arXiv:2602.10729, 2026

Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

work page arXiv 2026
[49]

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Parallax: Efficient llm inference service over decentralized environment

Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

work page arXiv 2025
[51]

Efficient multi-round llm inference over disaggregated serving

Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

work page arXiv 2026
[52]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

2024
[53]

Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023

Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023

work page arXiv 2023
[54]

Galvatron: Efficient transformer training over multiple gpus using automatic parallelism

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022

work page arXiv 2022
[55]

Osdp: Optimal sharded data parallel for distributed deep learning

Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022

work page arXiv 2022
[56]

Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus

HE Guoliang, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. In The Thirty-ninth Annual Conference on Neural Information Processing Systems
[57]

Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

2024
[58]

Hexiscale: Accommodating large language model training over heterogeneous environment

Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. Hexiscale: Accommodating large language model training over heterogeneous environment. arXiv preprint arXiv:2409.01143, 2024. 21 Table 5Notations. Symbol Description i∈ {1, . . . , N} Timestamp index;Nis the total number of timestamps. σi Scheduling plan at timestampi, specifying ...

work page internal anchor Pith review arXiv 2024