Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, and replica counts to achieve up to 2.4x higher throughput and 27x lower latency.
Aibrix: Towards scalable, cost-effective large language model inference infrastructure
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.DC 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
RcLLM accelerates generative recommendation inference by 1.31x-9.51x in TTFT through beyond-prefix KV caching, replicated user caches, sharded item caches, affinity scheduling, and selective attention with negligible accuracy loss.
citing papers explorer
-
Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines
Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, and replica counts to achieve up to 2.4x higher throughput and 27x lower latency.
-
RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
RcLLM accelerates generative recommendation inference by 1.31x-9.51x in TTFT through beyond-prefix KV caching, replicated user caches, sharded item caches, affinity scheduling, and selective attention with negligible accuracy loss.