pith. sign in

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

In Large Language Model (LLM) inference services, it is challenging to make a parallelism strategy configuration, to efficiently process the requests of variance context lengths. Requests of long context require high degree of parallelism to provide more memory for Key-Value (KV) Cache, while requests of short context prefer low degree of parallelism to increase concurrency, thus improving throughput. To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

fields

cs.DC 1 cs.LG 1

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

Libra: Efficient Resource Management for Agentic RL Post-Training

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.

citing papers explorer

Showing 2 of 2 citing papers.