Serving models, fast and slow: optimizing heterogeneous llm inferencing workloads at scale

Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, et al · 2025 · arXiv 2502.14617

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

cs.DC · 2026-04-17 · unverdicted · novelty 6.0

KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

cs.LG · 2026-03-22 · unverdicted · novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

RAP: Runtime Adaptive Pruning for LLM Inference

cs.LG · 2025-05-22 · unverdicted · novelty 5.0

RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.

Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs

cs.DC · 2026-06-29 · unverdicted · novelty 4.0

Festina reduces energy consumption by up to 56% for serverless LLM inference on shared GPUs while keeping TTFT/TBT SLO attainment within 2% of four state-of-the-art baselines.

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

cs.DC · 2026-05-16 · unverdicted · novelty 4.0

GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

citing papers explorer

Showing 6 of 6 citing papers.

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving cs.DC · 2026-04-17 · unverdicted · none · ref 24
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving cs.CL · 2026-04-09 · unverdicted · none · ref 8
Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project cs.LG · 2026-03-22 · unverdicted · none · ref 52
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
RAP: Runtime Adaptive Pruning for LLM Inference cs.LG · 2025-05-22 · unverdicted · none · ref 17
RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.
Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs cs.DC · 2026-06-29 · unverdicted · none · ref 16
Festina reduces energy consumption by up to 56% for serverless LLM inference on shared GPUs while keeping TTFT/TBT SLO attainment within 2% of four state-of-the-art baselines.
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources cs.DC · 2026-05-16 · unverdicted · none · ref 11
GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

Serving models, fast and slow: optimizing heterogeneous llm inferencing workloads at scale

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer