Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

Haoyu Chen , Xue Li , Kun Qian , Yu Guan , Jin Zhao , Xin Wang

Authors on Pith no claims yet

classification 💻 cs.DC

keywords contextrequestsamoebainferenceparallelismservicesthroughputdegree

read the original abstract

In Large Language Model (LLM) inference services, it is challenging to make a parallelism strategy configuration, to efficiently process the requests of variance context lengths. Requests of long context require high degree of parallelism to provide more memory for Key-Value (KV) Cache, while requests of short context prefer low degree of parallelism to increase concurrency, thus improving throughput. To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
cs.DC 2026-04 unverdicted novelty 6.0

Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.