SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering
Pith reviewed 2026-07-02 17:17 UTC · model grok-4.3
The pith
Context transformations in LLM agents can execute asynchronously via segment decomposability to eliminate TTFT overhead
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Context transformations are segment-decomposable so that the transformation applied to a prefix is independent of future tokens; this allows a lookahead programming model to schedule the transformations asynchronously, the runtime to precompute the corresponding KV caches, and a lookahead-aware scheduler to swap contexts without re-prefill or interference with latency-critical work.
What carries the argument
Lookahead programming model that marks context transformations as asynchronous operations, backed by proactive KV-cache preparation and a lookahead-aware scheduler
If this is right
- Agent frameworks can apply offloading, reduction, and isolation without paying re-prefill cost on each change.
- Transformed KV caches are ready for immediate use at the moment the context switch occurs.
- The scheduler can interleave lookahead requests with normal inference while keeping interference bounded.
- No changes to existing agent execution logic are required to obtain the latency benefit.
Where Pith is reading between the lines
- The same segment-independence property may apply to other dynamic context operations such as retrieval or memory consolidation in long-running agents.
- Removing the transformation bottleneck could make multi-hour agent sessions practical on current serving hardware.
- Tighter coupling between the lookahead scheduler and tool-calling loops might further reduce end-to-end latency beyond the reported TTFT gains.
Load-bearing premise
Every context transformation can be applied to a prefix without depending on any later tokens in the sequence.
What would settle it
A concrete context transformation (for example a summarizer or offloader) in which changing tokens after position k alters the correct output for the first k tokens, so that any precomputed cache for the prefix is wrong.
Figures
read the original abstract
LLM-based agents execute multi-turn workflows with continuously growing contexts, where LLM calls are interleaved with tool invocations and environment feedback. To maintain model quality, modern agent frameworks rely on context engineering strategies such as offloading, reduction, and isolation to control the context length. However, these strategies introduce significant context transformation overhead: each transformation invalidates existing KV caches and triggers re-prefill, leading to increased time-to-first-token (TTFT). In this paper, we identify that context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens. This property enables transformations to be executed ahead of time. Based on this insight, we propose a lookahead programming model that allows agent frameworks to express context transformations as asynchronous operations without modifying their execution logic. The runtime proactively executes these transformations and prepares transformed KV caches in advance, enabling direct context replacement without blocking. We further design a lookahead-aware scheduler in LLM serving systems to support these asynchronous requests alongside latency-critical workloads with controlled interference. We implement our approach to support representative context engineering strategies and integrate it into existing agent frameworks and LLM serving systems. Experiments show that our approach effectively eliminates transformation overhead and reduces TTFT by up to 11.9x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that context transformations (offloading, reduction, isolation) in long-horizon LLM agent workflows are segment-decomposable, i.e., the transformation of any prefix is independent of future tokens. This property enables a lookahead programming model for expressing transformations as asynchronous operations, proactive KV-cache preparation by the runtime, and a lookahead-aware scheduler that supports these requests with controlled interference, ultimately eliminating transformation overhead and reducing TTFT by up to 11.9x.
Significance. If the segment-decomposability property holds with equivalence to monolithic transformations and the scheduler introduces no quality loss, the approach would allow agent frameworks to hide context-engineering latency, improving responsiveness for multi-turn workflows that interleave LLM calls with tools and feedback.
major comments (2)
- [Abstract] Abstract: the central claim rests on the assertion that 'context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens,' yet the manuscript supplies neither a formal characterization of the class of transformations for which this holds nor an equivalence argument (proof or empirical check) that segment-wise results equal the monolithic result. This is load-bearing because reduction and isolation strategies commonly rely on aggregate statistics or environment feedback that can retroactively affect earlier segments, risking incorrect KV caches.
- [Experiments] The experimental claim of up to 11.9x TTFT reduction is presented without reported methodology, baselines, datasets, or quality metrics (e.g., downstream task accuracy or KV-cache equivalence checks) that would confirm the decomposability assumption was not violated in the tested strategies.
minor comments (1)
- The abstract states integration into 'existing agent frameworks and LLM serving systems' but does not name the specific frameworks or serving systems used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim rests on the assertion that 'context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens,' yet the manuscript supplies neither a formal characterization of the class of transformations for which this holds nor an equivalence argument (proof or empirical check) that segment-wise results equal the monolithic result. This is load-bearing because reduction and isolation strategies commonly rely on aggregate statistics or environment feedback that can retroactively affect earlier segments, risking incorrect KV caches.
Authors: We agree that a formal characterization and equivalence argument would strengthen the central claim. The manuscript defines segment-decomposability for the specific class of transformations (offloading via local token scoring, reduction via prefix-local summarization or pruning, and isolation via non-crossing context boundaries) where each prefix transformation is independent of future tokens by construction. We will add a new subsection with a formal definition of the property, a proof sketch showing equivalence to the monolithic case for these strategies, and empirical checks comparing segment-wise versus full-context outputs on representative workflows. revision: yes
-
Referee: [Experiments] The experimental claim of up to 11.9x TTFT reduction is presented without reported methodology, baselines, datasets, or quality metrics (e.g., downstream task accuracy or KV-cache equivalence checks) that would confirm the decomposability assumption was not violated in the tested strategies.
Authors: The full manuscript reports the experimental methodology, baselines (vanilla vLLM and Hugging Face serving), datasets (long-horizon tasks from AgentBench and custom multi-turn workflows), and quality metrics (task accuracy, output equivalence for KV-cache validation, and TTFT) in Sections 5 and 6. The 11.9x result is obtained only on workloads where decomposability was verified via equivalence checks. We will revise the abstract and introduction to explicitly reference these sections and add a dedicated table of KV-cache equivalence results. revision: yes
Circularity Check
No significant circularity; claims rest on empirical results and stated insight
full rationale
The paper presents an empirical performance claim (TTFT reduction up to 11.9x) supported by experiments on implemented strategies, alongside an identified property (segment-decomposability) used to motivate a lookahead model. No equations, fitted parameters, or self-citations are shown that reduce the central result to its own inputs by construction. The derivation chain is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens.
Reference graph
Works this paper leans on
-
[1]
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang
- [2]
-
[3]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134
2024
-
[4]
Anthropic. 2024. Building effective agents. https://www.anthropic.com/engine ering/building-effective-agents
2024
-
[5]
Anthropic. 2025. Claude Code. https://www.anthropic.com/claude-code
2025
-
[6]
Anthropic. 2025. Effective context engineering for AI agents. https://www.anth ropic.com/engineering/effective-context-engineering-for-ai-agents
2025
-
[7]
Anthropic. 2025. How we built our multi-agent research system. https://www. anthropic.com/engineering/multi-agent-research-system
2025
- [8]
-
[9]
Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xu- peng Miao, Tianqi Chen, and Zhihao Jia. 2025. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs...
-
[10]
Chester Curme and Mason Daugherty. 2026. Context Management for Deep Agents. https://blog.langchain.com/context-management-for-deepagents
2026
-
[11]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359
2022
-
[12]
DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.054962, 3 (2024), 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Hugging Face. 2025. Open-source DeepResearch - Freeing our search agents. https://huggingface.co/blog/open-deep-research
2025
-
[15]
Taosong Fang, Zhen Zheng, Zhengzhao Ma, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2026. FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap.Proceedings of Machine Learning and Systems(2026)
2026
-
[16]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 111–126
2024
-
[17]
Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28
2025
-
[18]
Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-tune: Harnessing large language models for automated database system tuning.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26
2025
-
[19]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems6 (2024), 325–338
2024
-
[20]
GLM-5-Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong,...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Kanishk Goel, Jayashree Mohan, Nipun Kwatra, Ravi Shreyas Anupindi, and Ramachandran Ramjee. 2026. QoServe: Breaking the Silos of LLM Inference Serving. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1492–1507
2026
-
[22]
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. Deepspeed-fastgen: High- throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671(2024)
-
[23]
2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance
Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance. Technical Report. Chroma. https: //research.trychroma.com/context-rot
2025
-
[24]
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al . 2023. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.003523, 4 (2023), 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Zhengding Hu, Vibha Murthy, Zaifeng Pan, Wanlu Li, Xiaoyi Fang, Yufei Ding, and Yuke Wang. 2025. HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 623–638
2025
- [26]
-
[27]
Yichao Ji. 2025. Context Engineering for AI Agents: Lessons from Building Manus. https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons- from-Building-Manus
2025
-
[28]
Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso
-
[29]
Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.Proceedings of the VLDB Endowment18, 1 (2024), 42–52
2024
-
[30]
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems44, 1 (2025), 1–27
2025
-
[31]
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897–912
2025
-
[32]
Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System.arXiv preprint arXiv:2602.13692(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626
2023
-
[34]
LangChain. 2025. Context Engineering. https://blog.langchain.com/context- engineering-for-agents
2025
-
[35]
LangChain. 2026. LangChain: The agent engineering platform. https://www.la ngchain.com
2026
-
[36]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172
2024
-
[37]
Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 6342– 6353
2023
-
[38]
Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2023. Swift- sage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems36 (2023), 23813–23825
2023
-
[39]
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of {LLM-based} Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945
2024
-
[40]
Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Luis Ceze, and Baris Kasikci. 2025. TeleRAG: Effi- cient retrieval-augmented generation inference with lookahead retrieval.arXiv preprint arXiv:2502.20969(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Jerry Liu. 2022. LlamaIndex. https://github.com/jerryjliu/llama_index
2022
-
[42]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173
2024
-
[43]
Gonzalez, and Aditya G
Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. 2026. Supporting our ai overlords: Redesigning data systems to be agent-first.Proceedings of CIDR 2026(2026)
2026
- [44]
-
[45]
Kuan Lu, Zhihui Yang, Sai Wu, Ruichen Xia, Dongxiang Zhang, and Gang Chen
-
[46]
Adda: Towards efficient in-database feature generation via llm-based agents.Proceedings of the ACM on Management of Data3, 3 (2025), 1–27
2025
- [47]
-
[48]
Yuyu Luo, Guoliang Li, Ju Fan, and Nan Tang. 2026. Data Agents: Levels, State of the Art, and Open Problems. InCompanion of the International Conference on Management of Data. 571–579
2026
-
[49]
Microsoft. 2026. Autogen: Open-Source Framework for Agentic AI. https: //www.microsoft.com/en-us/research/project/autogen
2026
-
[50]
MiniMax. 2026. Mini Agent. https://github.com/MiniMax-AI/Mini-Agent
2026
-
[51]
NVIDIA. 2026. NVIDIA Inference Xfer Library (NIXL). https://github.com/ai- dynamo/nixl
2026
-
[52]
James Pan and Guoliang Li. 2025. Database Perspective on LLM Inference Systems.Proceedings of the VLDB Endowment18, 12 (2025), 5504–5507
2025
-
[53]
Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Zhongkai Yu, Xulong Tang, Yida Wang, and Yufei Ding. 2025. FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference. InProceedings of Machine Learning and Systems
2025
- [54]
- [55]
-
[56]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132
2024
-
[57]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170
2025
-
[58]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, and Lianhui Qin. 2025. Simworld: An open-ended realis...
-
[60]
Rya Sanovar, Srikant Bharadwaj, Renee St Amant, Victor Rühle, and Saravan Rajmohan. 2025. LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers.Proceedings of Machine Learning and Systems7 (2025)
2025
-
[61]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learn- ing.Advances in Neural Information Processing Systems36 (2023), 8634–8652
2023
-
[62]
Peter Steinberger and OpenClaw Community. 2026. OpenClaw: Personal AI Assistant. https://openclaw.ai
2026
-
[63]
StepFun, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin...
-
[64]
Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards end-to-end optimization of llm-based applications with ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1302–1316
2025
-
[65]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Noppanat Wadlom, Junyi Shen, and Yao Lu. 2026. Efficient LLM serving for agentic workflows: A data systems perspective.Proceedings of the ACM on Management of Data4, 3 (SIGMOD) (2026), 1–29
2026
-
[67]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...
2025
-
[68]
Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. 2025. Mirage: A {Multi-Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 21–38
2025
-
[69]
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Un- structured Sparsity.Proceedings of the VLDB Endowment17, 2 (2023), 211–224
2023
-
[70]
Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, and Christos Kozyrakis. 2025. Ai metropolis: Scaling large language model-based multi-agent simulation with out-of-order execution.Proceedings of Machine Learning and Systems7 (2025)
2025
-
[71]
Qian Xu, Juan Yang, Feng Zhang, Junda Pan, Kang Chen, Youren Shen, Amelie Chi Zhou, and Xiaoyong Du. 2025. Tribase: A vector data query engine for reliable and lossless pruning compression using triangle inequalities.Proceedings of the ACM on Management of Data3, 1 (2025), 1–28
2025
-
[72]
Qian Xu, Feng Zhang, Chengxi Li, Lei Cao, Zheng Chen, Jidong Zhai, and Xi- aoyong Du. 2025. Harmony: A scalable distributed vector database for high- throughput approximate nearest neighbor search.Proceedings of the ACM on Management of Data3, 4 (2025), 1–28
2025
-
[73]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652
2024
-
[75]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109
2025
-
[76]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35 (2022), 20744–20757
2022
-
[77]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)
2023
-
[78]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze
-
[79]
InEighth Conference on Machine Learning and Systems
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InEighth Conference on Machine Learning and Systems
-
[80]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). 521–538
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.