Aibrix: Towards scalable, cost-effective large language model inference infrastructure

The AIBrix Team: Jiaxin Shan, Varun Gupta, Le Xu, Haiyang Shi, Jingyuan Zhang, Ning Wang, Linhui Xu, Rong Kang, Tongping Liu, Yifei Zhang, Yiqing Zhu, Shuowei Jin, Gangmuk Lim, Binbin Chen, Zuzhi Chen, Xiao Liu, Xin Chen, Kante Yin, Chak-Po · 2025 · arXiv 2504.03648

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

cs.DC · 2026-04-16 · unverdicted · novelty 6.0

Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, and replica counts to achieve up to 2.4x higher throughput and 27x lower latency.

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing

cs.DC · 2025-12-22 · conditional · novelty 6.0

CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.

RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

cs.DC · 2026-05-08 · unverdicted · novelty 5.0

RcLLM accelerates generative recommendation inference by 1.31x-9.51x in TTFT through beyond-prefix KV caching, replicated user caches, sharded item caches, affinity scheduling, and selective attention with negligible accuracy loss.

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

cs.LG · 2025-09-25 · unverdicted · novelty 5.0

Learning-augmented LRU achieves 1-consistency and O(k)-robustness for GPU caching with low overhead, implemented in LCR to cut P99 TTFT by up to 28.3% on LLM workloads and raise throughput by up to 24.2% on DLRM workloads.

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

cs.DC · 2026-05-16 · unverdicted · novelty 4.0

GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference cs.LG · 2025-09-25 · unverdicted · none · ref 37
Learning-augmented LRU achieves 1-consistency and O(k)-robustness for GPU caching with low overhead, implemented in LCR to cut P99 TTFT by up to 28.3% on LLM workloads and raise throughput by up to 24.2% on DLRM workloads.

Aibrix: Towards scalable, cost-effective large language model inference infrastructure

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer