Recognition: unknown
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3
The pith
Autopoiesis uses LLMs to continuously synthesize and rewrite serving policies from real-time observations, replacing static human designs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Serving policies are no longer static artifacts designed by humans before deployment, but living code that LLMs continuously evolve throughout deployment to navigate runtime trade-offs beyond human design, achieved through an LLM-driven program synthesis workflow that observes real-world behavior and rewrites policy code as conditions change.
What carries the argument
LLM-driven program synthesis workflow that observes runtime dynamics and rewrites serving policy code continuously during operation.
If this is right
- Optimal scheduling and rescheduling decisions adapt automatically to workload-specific trade-offs that evolve over time.
- Policy maintenance shifts from periodic human redesign to continuous autonomous evolution integrated into the serving loop.
- System performance improves measurably, reaching up to 53 percent gains and 34 percent on average across diverse runtime conditions.
Where Pith is reading between the lines
- The approach could reduce the engineering effort required to tune serving systems for new hardware or traffic patterns.
- Similar self-synthesis loops might apply to other dynamic resource managers facing intertwined overhead and efficiency choices.
- Long-term operation could accumulate a growing library of evolved policies that future instances reuse or refine.
Load-bearing premise
An LLM can reliably generate correct, efficient, and safe serving-policy code from real-time observations without introducing bugs, excessive overhead, or decisions that cancel out the performance gains.
What would settle it
A controlled run in which the LLM-synthesized policies produce higher latency, instability, or security violations than the original static baselines under identical workload fluctuations and autoscaling events.
read the original abstract
Modern Large Language Model (LLM) serving operates in highly volatile environments characterized by severe runtime dynamics, such as workload fluctuations and elastic cluster autoscaling. Traditional serving systems rely on static, human-engineered serving policies (e.g., scheduling algorithms and rescheduling strategies) to manage these dynamics. However, these policies must navigate deeply intertwined runtime trade-offs (e.g., scheduling overhead vs. execution efficiency, rescheduling frequency vs. reconfiguration overhead), whose optimal balance is workload-specific and shifts continuously as runtime conditions evolve, rendering any fixed policy fundamentally unable to adapt. We propose Autopoiesis, a novel online self-evolving system that shifts LLM serving from static policy deployment to continuous online policy evolution. First, Autopoiesis introduces an LLM-driven program synthesis workflow to evolve serving policies with respect to real-time observed dynamics, where the evolved policies reflect the optimal decision in navigating the complex, multi-dimensional trade-off space. Second, Autopoiesis enables this synthesis process to operate continuously during serving, observing real-world system behavior, and rewriting the policy code as runtime trade-offs shift, thereby transforming policy design from a one-time offline endeavor into an ongoing system component, enabling autonomous adaptation to evolving runtime conditions. Together, we establish a new paradigm: Serving policies are no longer static artifacts designed by humans before deployment, but living code that LLMs continuously evolve throughout deployment to navigate runtime trade-offs beyond human design. We evaluate Autopoiesis across diverse runtime dynamics and show up to 53% and on average 34% improvements over state-of-the-art LLM serving systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Autopoiesis, a novel online self-evolving system for LLM serving that uses LLM-driven program synthesis to continuously evolve serving policies in response to runtime dynamics such as workload fluctuations and elastic autoscaling. It contrasts this with traditional static, human-engineered policies that cannot adapt to shifting trade-offs. The system enables ongoing policy rewriting during serving to achieve better navigation of multi-dimensional trade-offs. Evaluations are claimed to demonstrate up to 53% and on average 34% improvements over state-of-the-art LLM serving systems across diverse runtime dynamics.
Significance. If the self-evolving mechanism proves reliable, this work could mark a paradigm shift in LLM serving from static to adaptive, living policies, offering substantial efficiency gains in volatile production environments. The approach has the potential to reduce the need for manual policy tuning and improve system performance under changing conditions. However, the significance is currently limited by the lack of detailed evidence and mechanisms in the manuscript.
major comments (3)
- [Abstract] The performance claims of up to 53% and average 34% improvements are presented without any supporting details on the experimental setup, including the state-of-the-art systems used as baselines, the specific runtime dynamics tested, workload characteristics, or quantitative evaluation data. This absence makes it impossible to determine whether the gains are due to the proposed self-evolution or other unstated factors.
- [LLM-driven program synthesis workflow] The description of the online policy evolution lacks any mention of safeguards such as code verification, sandboxing, bounded execution, or rollback mechanisms to handle potentially incorrect or unsafe code synthesized by the LLM. Since the central claim depends on the LLM reliably producing correct and efficient serving policies from real-time observations, this omission is load-bearing and risks the synthesized policies introducing bugs that negate the reported benefits.
- [Paradigm description] While the shift from static to continuously evolving policies is conceptually appealing, the manuscript provides no details on how the synthesis process integrates with the live serving system or manages the overhead of continuous observation and rewriting, which could itself impact performance.
minor comments (2)
- [Abstract] The abstract is quite dense with long sentences; splitting some for better flow would enhance readability.
- The term 'Autopoiesis' is used without explaining its connection to the biological concept or the specific analogy intended in this context.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to provide greater clarity on experimental details, safety mechanisms, and integration overhead while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] The performance claims of up to 53% and average 34% improvements are presented without any supporting details on the experimental setup, including the state-of-the-art systems used as baselines, the specific runtime dynamics tested, workload characteristics, or quantitative evaluation data. This absence makes it impossible to determine whether the gains are due to the proposed self-evolution or other unstated factors.
Authors: We agree the abstract is concise and omits setup specifics. The full manuscript details these in the Evaluation section, including baselines as state-of-the-art LLM serving systems, runtime dynamics such as workload fluctuations and elastic autoscaling, workload traces, and quantitative results with breakdowns. To make the abstract more self-contained, we will add a brief clause summarizing the evaluation conditions and confirming the gains are measured against those baselines under the described dynamics. revision: yes
-
Referee: [LLM-driven program synthesis workflow] The description of the online policy evolution lacks any mention of safeguards such as code verification, sandboxing, bounded execution, or rollback mechanisms to handle potentially incorrect or unsafe code synthesized by the LLM. Since the central claim depends on the LLM reliably producing correct and efficient serving policies from real-time observations, this omission is load-bearing and risks the synthesized policies introducing bugs that negate the reported benefits.
Authors: This is a substantive concern for any LLM-based code generation in production. The original manuscript emphasizes the synthesis workflow but does not explicitly cover reliability safeguards. In revision we will add a dedicated subsection describing: static verification and syntax checking of synthesized policies, sandboxed execution against synthetic workloads prior to live deployment, execution bounds on time and resources, and an automatic rollback to the prior policy version upon detected degradation or errors. We will also report supporting measurements showing these mechanisms maintain system stability. revision: yes
-
Referee: [Paradigm description] While the shift from static to continuously evolving policies is conceptually appealing, the manuscript provides no details on how the synthesis process integrates with the live serving system or manages the overhead of continuous observation and rewriting, which could itself impact performance.
Authors: We acknowledge the need for explicit integration and overhead analysis. The manuscript outlines the continuous workflow but we will expand the System Design and Implementation sections to describe non-intrusive metric collection via existing hooks, periodic or event-triggered synthesis, and atomic policy updates. We will add quantitative overhead results (synthesis and rewriting cost as a small fraction of serving time) with new figures demonstrating that net gains remain substantial after accounting for this cost. revision: yes
Circularity Check
No circularity: empirical evaluation of a proposed system, not a derived prediction
full rationale
The paper introduces Autopoiesis as an LLM-driven online policy synthesis system for LLM serving and reports measured improvements (up to 53%, average 34%) from direct evaluation on runtime dynamics. No equations, fitted parameters, or first-principles derivations appear in the abstract or described chain; the performance numbers are presented as experimental outcomes rather than quantities computed from the system's own inputs or prior self-citations. The central claim rests on the feasibility of LLM code synthesis, which is an engineering assumption subject to external verification, not a self-referential reduction. No load-bearing self-citation, ansatz smuggling, or renaming of known results is evident. The derivation chain is therefore self-contained as a system proposal plus empirical results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can synthesize correct and optimal serving policies from observed runtime dynamics
invented entities (1)
-
Autopoiesis self-evolving serving system
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Burstgpt: A real-world workload dataset to optimize llm serving systems
Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5831–5841, 2025
2025
-
[2]
Llumnix: Dynamic schedul- ing for large language model serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic schedul- ing for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024
2024
-
[3]
Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024
2024
-
[4]
{AlpaServe}: Statistical multiplexing with model parallelism for deep learning 18 serving
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning 18 serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023
2023
-
[5]
Fairness in serving large language models
Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E Gonzalez, and Ion Stoica. Fairness in serving large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 965–988, 2024
2024
-
[6]
{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024
2024
-
[7]
{ServerlessLLM}:{Low-Latency} serverless inference for large language models
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. {ServerlessLLM}:{Low-Latency} serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024
2024
-
[8]
Aegaeon: Effective gpu pooling for concurrent llm serving on the market
Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1030–1045, 2025
2025
-
[9]
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024
2024
-
[10]
Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections
Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 195–210, 2024
2024
-
[11]
Spotserve: Serving generative large language models on preemptible instances
Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serving generative large language models on preemptible instances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1112–1127, 2024
2024
-
[12]
Thunderserve: High-performance and cost-efficient llm serving in cloud environments
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025
2025
-
[13]
Skyserve: Serving ai models across regions and clouds with spot instances
Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025
2025
-
[14]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
2023
-
[15]
Orca: A distributed serving system for {Transformer-Based} generative models
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022
2022
-
[16]
arXiv preprint arXiv:2404.08509 , year=
Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T Kalbarczyk, Tamer Ba¸ sar, and Ravishankar K Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction. arXiv preprint arXiv:2404.08509, 2024
-
[17]
Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, and Haibo Chen. Kunserve: Parameter-centric memory management for efficient memory overloading handling in llm serving. arXiv preprint arXiv:2412.18169, 2024
-
[18]
Mooncake: A kvcache-centric disaggregated architecture for llm serving
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving. ACM Transactions on Storage, 2024
2024
-
[19]
Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment,
Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025
-
[20]
Mathematical discoveries from program search with large language models
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024. 19
2024
-
[21]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery .arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
Evox: Meta-evolution for automated discovery
Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026
-
[23]
Openevolve: An open-source evolutionary coding agent, 2025
Asankhaya Sharma. Openevolve: An open-source evolutionary coding agent, 2025
2025
-
[24]
Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta, 2025
Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta. arXiv preprint arXiv:2512.23236, 2025
-
[25]
Barbarians at the gate: How AI is upending systems research
Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research. arXiv preprint arXiv:2510.06189, 2025
-
[26]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary , Raul Puri, Patrick LeGresley , Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review arXiv 1909
-
[27]
Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019
2019
-
[28]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review arXiv 2023
-
[29]
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pages 18332–18346. PMLR, 2022
2022
-
[30]
Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization. arXiv preprint arXiv:2510.14150, 2025
-
[31]
Genetic programming as a means for programming computers by natural selection
John R Koza. Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4(2):87–112, 1994
1994
-
[32]
Evolution of kernels: Automated risc-v kernel optimization with large language models
Siyuan Chen, Zhichao Lu, and Qingfu Zhang. Evolution of kernels: Automated risc-v kernel optimization with large language models. arXiv preprint arXiv:2509.14265, 2025
-
[33]
Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909, 2015
-
[34]
How SwissAI uses OpenTela for scalable LLM serving
Xiaozhe Yao. How SwissAI uses OpenTela for scalable LLM serving. Xiaozhe Yao (Blog), March 2026. Accessed: 2026-03-16
2026
-
[35]
Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql
You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025
-
[36]
Distributed genetic algorithms for function optimization
Reiko Tanese. Distributed genetic algorithms for function optimization. University of Michigan, 1989
1989
-
[37]
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363, 2024
-
[38]
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025
-
[39]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 20
2023
-
[40]
Longbench: A bilingual, multitask benchmark for long context understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024
2024
-
[41]
Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider
Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry , Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20), pages 205–218, 2020
2020
-
[42]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
Theory of linear and integer programming
Alexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998
1998
-
[44]
Efficient mixed- precision large language model inference with turbomind
Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025
-
[45]
Yuhang Wang, Youhe Jiang, Bin Cui, and Fangcheng Fu. Thinking short and right over thinking long: Serving llm reasoning efficiently and accurately .arXiv preprint arXiv:2505.13326, 2025
-
[46]
Lmdeploy: A toolkit for compressing, deploying, and serving llm, 2023
LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm, 2023
2023
-
[47]
Cascadia: An efficient cascade serving system for large language models
Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025
-
[48]
Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026
-
[49]
OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Parallax: Efficient llm inference service over decentralized environment
Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025
-
[51]
Efficient multi-round llm inference over disaggregated serving
Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026
-
[52]
Splitwise: Efficient generative llm inference using phase splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024
2024
-
[53]
Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023
Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023
-
[54]
Galvatron: Efficient transformer training over multiple gpus using automatic parallelism
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022
-
[55]
Osdp: Optimal sharded data parallel for distributed deep learning
Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022
-
[56]
Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus
HE Guoliang, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. In The Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[57]
Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024
2024
-
[58]
Hexiscale: Accommodating large language model training over heterogeneous environment
Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. Hexiscale: Accommodating large language model training over heterogeneous environment. arXiv preprint arXiv:2409.01143, 2024. 21 Table 5Notations. Symbol Description i∈ {1, . . . , N} Timestamp index;Nis the total number of timestamps. σi Scheduling plan at timestampi, specifying ...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.