pith. machine review for the scientific record. sign in

arxiv: 2412.03594 · v3 · submitted 2024-11-29 · 💻 cs.CL · cs.AI· cs.DC· cs.LG

Recognition: unknown

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng , Xin Ji , Taosong Fang , Fanghao Zhou , Chuanjie Liu , Gang Peng

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.DCcs.LG
keywords batchllmprefixrequestslargesharingtasksbatchedcommon
0
0 comments X
read the original abstract

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments. Code is available at https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.

  2. ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.

  3. PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

    cs.DC 2026-05 unverdicted novelty 5.0

    PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.

  4. Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

    cs.LG 2026-03 unverdicted novelty 2.0

    The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.