Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving
Pith reviewed 2026-06-28 12:54 UTC · model grok-4.3
The pith
Raising the scheduling unit to the conversation level makes LLM agent placement decisions observable without prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Raising the scheduling unit from the turn to the conversation converts turn-level irregularity into a stable, two-phase structure: a compute-bound turn-1 prefill followed by a long, memory-bound tail. With the conversation as the scheduling unit, placement reduces to reading the first-turn input length and per-decoder KV occupancy, both directly observable.
What carries the argument
The conversation as the scheduling unit, which converts turn-level irregularity into a stable two-phase structure of first-turn prefill and memory-bound tail.
Load-bearing premise
The workload exhibits a stable two-phase structure once the scheduling unit is raised to the conversation level.
What would settle it
A trace of agent conversations where the first-turn input length and per-decoder KV occupancy fail to indicate the remaining turns' compute or memory demands.
Figures
read the original abstract
LLM-based agents resolve a user task through many turns of dependent inference and tool calls, producing a workload whose total cost is unknown when the task arrives. Existing multi-turn systems keep the turn as the scheduling unit and decide, turn by turn, whether to disaggregate prefill from decode. That decision rests on the turn's decode length, tool behavior, and KV growth, quantities that are not observable when the scheduler must act, forcing the system to predict them. We show this dependence on prediction is imposed by the scheduling unit, not the workload. Raising the scheduling unit from the turn to the conversation converts turn-level irregularity into a stable, two-phase structure: 1) a compute-bound turn-1 prefill followed by 2) a long, memory-bound tail. Thus, with the conversation as the scheduling unit, placement reduces to reading the first-turn input length and per-decoder KV occupancy, both directly observable. We instantiate this principle in ConServe, which routes the first-turn prefill to a high-throughput prefiller, transfers the KV cache exactly once, and pins the conversation to a single decoder for its entire tail, with no learned model of decode-side cost. Against a per-turn prediction baseline, ConServe reduces p95 time-to-first-effective-token (the latency of a conversation's first user-visible output) by 51.08% and improves energy efficiency by 7.51% while preserving last-turn TBT and SLOs; mapping the two phases onto heterogeneous GPU tiers adds a further 22.75% in energy efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that dependence on prediction in disaggregated scheduling for agentic LLM serving arises from using the turn (rather than the conversation) as the scheduling unit. Raising the unit to the conversation level converts turn-level irregularity into a stable two-phase structure (compute-bound turn-1 prefill followed by long memory-bound tail), allowing placement decisions based solely on directly observable quantities (first-turn input length and per-decoder KV occupancy). ConServe implements this by routing the first prefill to a high-throughput prefiller, performing a single KV transfer, and pinning the conversation to one decoder with no decode-cost model. Against a per-turn prediction baseline it reports 51.08% lower p95 time-to-first-effective-token, 7.51% better energy efficiency, and an additional 22.75% energy gain when mapping phases to heterogeneous GPU tiers.
Significance. If the two-phase structure holds, the work offers a clean way to remove learned predictors from conversation-level scheduling while preserving SLOs, which could simplify production agentic serving stacks. The reported latency and energy deltas are large enough to be practically relevant if reproducible.
major comments (2)
- [Abstract] Abstract: the central claim that 'dependence on prediction is imposed by the scheduling unit, not the workload' rests on the assertion of a stable two-phase structure. The text supplies no argument or data showing that tool-call outputs do not introduce large new inputs (and therefore non-trivial prefills) after turn 1; if such inputs occur, later turns cease to be purely memory-bound and the single-KV-transfer, no-per-turn-decision design no longer eliminates prediction.
- [Abstract] Abstract: the concrete performance deltas (51.08% p95 TTFT reduction, 7.51% energy improvement) are stated without any description of experimental setup, workload traces, baseline implementations, measurement methodology, or statistical significance, so the results cannot be evaluated against the central claim.
minor comments (1)
- The phrase 'time-to-first-effective-token' is introduced without definition; clarify what makes a token 'effective' and how it differs from standard TTFT.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the presentation of the central claim and results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'dependence on prediction is imposed by the scheduling unit, not the workload' rests on the assertion of a stable two-phase structure. The text supplies no argument or data showing that tool-call outputs do not introduce large new inputs (and therefore non-trivial prefills) after turn 1; if such inputs occur, later turns cease to be purely memory-bound and the single-KV-transfer, no-per-turn-decision design no longer eliminates prediction.
Authors: We agree that the abstract itself does not contain workload data on post-turn-1 input sizes. The full manuscript (Section 3.2 and Figure 2) analyzes production agentic traces and reports that tool outputs average under 180 tokens, producing negligible additional prefill work relative to the initial turn. To make this evidence directly visible to readers of the abstract, we will add one sentence in the revised abstract and a short supporting paragraph with the input-length distribution in Section 3. This revision will explicitly tie the observed two-phase structure to the measured tool-output statistics. revision: yes
-
Referee: [Abstract] Abstract: the concrete performance deltas (51.08% p95 TTFT reduction, 7.51% energy improvement) are stated without any description of experimental setup, workload traces, baseline implementations, measurement methodology, or statistical significance, so the results cannot be evaluated against the central claim.
Authors: Abstract length constraints preclude full experimental details. The manuscript already supplies the complete setup in Section 5 (workloads, baselines, hardware, and statistical methodology). In the revision we will append a single clause to the abstract (“evaluated on agentic conversation traces with 8.4 turns on average”) and ensure the results paragraph cross-references Section 5, allowing readers to locate the supporting evidence without altering the abstract’s primary focus. revision: partial
Circularity Check
No circularity: derivation rests on direct observability of first-turn quantities, not fitted inputs or self-citations.
full rationale
The paper's core argument is that conversation-level scheduling converts the workload into an observable two-phase structure (first-turn prefill + memory-bound tail), allowing placement decisions from directly measurable first-turn input length and KV occupancy. The abstract and description contain no equations, fitted parameters, or self-citations that reduce the claim to its own inputs by construction. The stability premise is asserted as a workload property under the new scheduling unit rather than derived from prior author work or statistical fits. This is a standard non-circular finding for an observation-based systems paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Can I Buy Your KV Cache?
Proposes an agent-native prefill CDN where precomputed KV caches are hosted and sold to agents, delivering 9-50x compute savings with exact token and logit matching on Qwen3-4B.
Reference graph
Works this paper leans on
-
[1]
Narasimhan and Yuan Cao , title =
Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,
-
[2]
Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =
Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =
-
[3]
Reflexion: language agents with verbal reinforcement learning , booktitle =
Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , editor =. Reflexion: language agents with verbal reinforcement learning , booktitle =
-
[4]
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable , booktitle =
Chaofan Lin and Zhenhua Han and Chengruidong Zhang and Yuqing Yang and Fan Yang and Chen Chen and Lili Qiu , editor =. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable , booktitle =
-
[5]
In Gim and Zhiyao Ma and SeungSeob Lee and Lin Zhong , editor =. Pie:. Proceedings of the
-
[6]
Gonzalez and Clark W
Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark W. Barrett and Ying Sheng , editor =. SGLang: Efficient Execution of Structured Language Model Programs , booktitle =
-
[7]
Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =
Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =
-
[8]
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving , booktitle =
Yinmin Zhong and Shengyu Liu and Junda Chen and Jianbo Hu and Yibo Zhu and Xuanzhe Liu and Xin Jin and Hao Zhang , editor =. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving , booktitle =
-
[9]
Splitwise: Efficient Generative
Pratyush Patel and Esha Choukse and Chaojie Zhang and Aashaka Shah and. Splitwise: Efficient Generative. 51st
-
[10]
Gulavani and Alexey Tumanov and Ramachandran Ramjee , editor =
Amey Agrawal and Nitin Kedia and Ashish Panwar and Jayashree Mohan and Nipun Kwatra and Bhargav S. Gulavani and Alexey Tumanov and Ramachandran Ramjee , editor =. Taming Throughput-Latency Tradeoff in. 18th
-
[11]
Mooncake: Trading More Storage for Less Computation -
Ruoyu Qin and Zheming Li and Weiran He and Jialei Cui and Feng Ren and Mingxing Zhang and Yongwei Wu and Weimin Zheng and Xinran Xu , editor =. Mooncake: Trading More Storage for Less Computation -. 23rd
-
[12]
Gomez and Lukasz Kaiser and Illia Polosukhin , editor =
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =
-
[13]
Efficiently Scaling Transformer Inference , booktitle =
Reiner Pope and Sholto Douglas and Aakanksha Chowdhery and Jacob Devlin and James Bradbury and Jonathan Heek and Kefan Xiao and Shivani Agrawal and Jeff Dean , editor =. Efficiently Scaling Transformer Inference , booktitle =
-
[14]
Proceedings of the
Yuhan Liu and Hanchen Li and Yihua Cheng and Siddhant Ray and Yuyang Huang and Qizheng Zhang and Kuntai Du and Jiayi Yao and Shan Lu and Ganesh Ananthanarayanan and Michael Maire and Henry Hoffmann and Ari Holtzman and Junchen Jiang , title =. Proceedings of the
-
[15]
Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025 , pages =
Jiayi Yao and Hanchen Li and Yuhan Liu and Siddhant Ray and Yihua Cheng and Qizheng Zhang and Kuntai Du and Shan Lu and Junchen Jiang , title =. Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025 , pages =
2025
-
[16]
Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference, 2024 , author=. URL https://arxiv. org/abs/2401.08671 , year=
arXiv 2024
-
[17]
Gulavani and Ramachandran Ramjee and Alexey Tumanov , editor =
Amey Agrawal and Nitin Kedia and Jayashree Mohan and Ashish Panwar and Nipun Kwatra and Bhargav S. Gulavani and Ramachandran Ramjee and Alexey Tumanov , editor =. Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 , publisher =
2024
-
[18]
2022 , url =
Chase, Harrison , title =. 2022 , url =
2022
-
[19]
2023 , url =
Richards, Toran Bruce , title =. 2023 , url =
2023
-
[20]
The Twelfth International Conference on Learning Representations,
Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , title =. The Twelfth International Conference on Learning Representations,
-
[21]
The Twelfth International Conference on Learning Representations,
Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , title =. The Twelfth Internatio...
-
[22]
arXiv preprint arXiv:2404.14527 , year=
M 'elange: Cost efficient large language model serving by exploiting gpu heterogeneity , author=. arXiv preprint arXiv:2404.14527 , year=
-
[23]
Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow , booktitle =
Yixuan Mei and Yonghao Zhuang and Xupeng Miao and Juncheng Yang and Zhihao Jia and Rashmi Vinayak , editor =. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow , booktitle =
-
[24]
The Thirteenth International Conference on Learning Representations,
Youhe Jiang and Ran Yan and Binhang Yuan , title =. The Thirteenth International Conference on Learning Representations,
-
[25]
HexGen: Generative Inference of Large Language Model over Heterogeneous Environment , booktitle =
Youhe Jiang and Ran Yan and Xiaozhe Yao and Yang Zhou and Beidi Chen and Binhang Yuan , editor =. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment , booktitle =
-
[26]
ThunderServe: High-performance and Cost-efficient
Youhe Jiang and Fangcheng Fu and Xiaozhe Yao and Taiyi Wang and Bin Cui and Ana Klimovic and Eiko Yoneki , editor =. ThunderServe: High-performance and Cost-efficient. Proceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025 , publisher =
2025
-
[27]
arXiv preprint arXiv:2511.02230 , year=
Continuum: Efficient and robust multi-turn llm agent scheduling with kv cache time-to-live , author=. arXiv preprint arXiv:2511.02230 , year=
-
[28]
arXiv preprint arXiv:2505.05286 , year=
HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL , author=. arXiv preprint arXiv:2505.05286 , year=
-
[29]
arXiv preprint arXiv:2601.11589 , year =
Jianshu She and Zonghang Li and Hongchao Du and Shangyu Wu and Wenhao Zheng and Eric Xing and Zhengzhong Liu and Huaxiu Yao and Jason Xue and Qirong Ho , title =. arXiv preprint arXiv:2601.11589 , year =
-
[30]
2018 , url=
Improving language understanding by generative pre-training , author=. 2018 , url=
2018
-
[31]
2019 , url=
Language models are unsupervised multitask learners , author=. 2019 , url=
2019
-
[32]
2024 , eprint=
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention , author=. 2024 , eprint=
2024
-
[33]
arXiv preprint arXiv:2510.09665 , year=
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference , author=. arXiv preprint arXiv:2510.09665 , year=
-
[34]
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency , url=
Stojkovic, Jovan and Zhang, Chaojie and Goiri, Íñigo and Torrellas, Josep and Choukse, Esha , year=. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency , url=. doi:10.1109/hpca61900.2025.00102 , booktitle=
-
[35]
2023 , howpublished=
LLM Inference Performance Engineering: Best Practices , author=. 2023 , howpublished=
2023
-
[36]
2024 , eprint=
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=
2024
-
[37]
Models, Usage, and Limits in Claude Code , year =
-
[38]
2026 , eprint=
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving , author=. 2026 , eprint=
2026
-
[39]
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =
Llumnix: Dynamic Scheduling for Large Language Model Serving , author =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages =. 2024 , publisher =
2024
-
[40]
2024 , eprint=
Efficient LLM Scheduling by Learning to Rank , author=. 2024 , eprint=
2024
-
[41]
2025 , eprint=
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing , author=. 2025 , eprint=
2025
-
[42]
Nexus: Proactive Intra-
Shi, Xiaoxiang and Cai, Colin and Du, Junjia and Jia, Zhihao , journal =. Nexus: Proactive Intra-
-
[43]
DuetServe: Harmonizing Prefill and Decode for
Gao, Lei and Jiang, Chaoyi and Zarch, Hossein Entezari and Wong, Daniel and Annavaram, Murali , journal =. DuetServe: Harmonizing Prefill and Decode for
-
[44]
2025 , eprint=
Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving , author=. 2025 , eprint=
2025
-
[45]
19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25) , pages =
BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching , author =. 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25) , pages =. 2025 , publisher =
2025
-
[46]
TokenScale: Timely and Accurate Autoscaling for Disaggregated
Lai, Ruiqi and Liu, Hongrui and Lu, Chengzhi and Liu, Zonghao and Cao, Siyu and Shao, Siyang and Zhang, Yixin and Mai, Luo and Ustiugov, Dmitrii , journal =. TokenScale: Timely and Accurate Autoscaling for Disaggregated
-
[47]
2025 , howpublished =
2025
-
[48]
2024 , eprint=
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool , author=. 2024 , eprint=
2024
-
[49]
2024 , eprint=
SnapKV: LLM Knows What You are Looking for Before Generation , author=. 2024 , eprint=
2024
-
[50]
2024 , eprint=
InferCept: Efficient Intercept Support for Augmented Large Language Model Inference , author=. 2024 , eprint=
2024
-
[51]
2026 , eprint=
Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference , author=. 2026 , eprint=
2026
-
[52]
2025 , eprint=
KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows , author=. 2025 , eprint=
2025
-
[53]
2025 , eprint=
Autellix: An Efficient Serving Engine for LLM Agents as General Programs , author=. 2025 , eprint=
2025
-
[54]
2024 , eprint=
GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions , author=. 2024 , eprint=
2024
-
[55]
arXiv preprint arXiv:2602.14516 , year=
Efficient multi-round llm inference over disaggregated serving , author=. arXiv preprint arXiv:2602.14516 , year=
-
[56]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[57]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[58]
M. J. Kearns , title =
-
[59]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[60]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[61]
Suppressed for Anonymity , author=
-
[62]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[63]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.