hub Canonical reference

Mooncake: A kvcache-centric disaggregated architecture for llm serving

· 2024 · arXiv 2407.00079

Canonical reference. 80% of citing Pith papers cite this work as background.

27 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 27 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2

citation-polarity summary

background 4 use method 1

representative citing papers

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

cs.OS · 2026-05-05 · unverdicted · novelty 7.0

Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

cs.DC · 2025-05-29 · conditional · novelty 7.0

GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.

OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

cs.DC · 2026-07-02 · unverdicted · novelty 6.0

OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

cs.DC · 2026-06-10 · unverdicted · novelty 6.0

Effective LLM inference cost per million output tokens varies 2.5-36x with offered request rate due to utilization, addressed by a concurrency-aware measurement methodology and open-source vLLM tool validated across model types.

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

cs.LG · 2026-06-07 · unverdicted · novelty 6.0

SpectrumKV applies per-token mixed-precision KV cache transfer (FP16/INT8/INT4) with a model-specific probe for INT4 tolerance, achieving better perplexity and retrieval than PDTrim at equivalent budgets on Qwen2.5-7B, Mistral-7B, and Gemma-2-9B.

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

Vortex provides a programmable frontend and backend for sparse attention in LLM serving, delivering up to 3.46x throughput over full attention while preserving accuracy.

Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

cs.OS · 2026-05-30 · unverdicted · novelty 6.0

MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.

PreFT: Prefill-only finetuning for efficient inference

cs.LG · 2026-05-14 · accept · novelty 6.0

Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

cs.AR · 2026-04-27 · unverdicted · novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

cs.LG · 2025-11-01 · unverdicted · novelty 6.0

Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.

TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

cs.DC · 2025-10-21 · unverdicted · novelty 6.0

TokenCake introduces agent-aware temporal and spatial schedulers for KV cache management in LLM multi-agent serving, claiming over 47% lower end-to-end latency and up to 16.9% better GPU memory utilization than vLLM on representative benchmarks.

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

cs.AR · 2025-05-19 · unverdicted · novelty 6.0

Sandwich delivers 2.01x average end-to-end speedup and up to 3.4x latency reduction for CPU LLM serving via phase-wise hot-switching, TopoTree hardware abstraction, and fast-start dynamic kernel generation.

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

cs.CL · 2024-11-29 · unverdicted · novelty 6.0

BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

cs.LG · 2024-09-16 · conditional · novelty 6.0

RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.

Omni-Flow: A Unified Workflow Orchestration and Distributed KV Cache Sharing Framework for Multimodal Inference

cs.DC · 2026-06-30 · unverdicted · novelty 5.0

Omni-Flow introduces a three-layer abstraction (Control Flow, Data Flow, Compute Flow) for unified orchestration and KV cache sharing in multimodal inference pipelines.

Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving

cs.DC · 2026-06-29 · unverdicted · novelty 5.0 · 2 refs

Organizes the heterogeneous LLM prefill-decode design space along four axes and extracts three boundary decisions with guidance on precision, KV representation, and ownership.

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

cs.LG · 2026-06-05 · unverdicted · novelty 5.0

SCD replaces raw KV cache transmission with compact semantic codes via reuse and patching to achieve up to 2.65x TTFT speedup while staying within 5% F1 of oracle quality.

Human-Less LLM Serving: Quantifying the Human Tax on Throughput

cs.NI · 2026-05-03 · unverdicted · novelty 5.0

Measurement study finds LLM serving systems sacrifice 60-93% throughput to meet human-centric TTFT/TPOT SLOs unnecessary for programmatic long-horizon tasks.

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

cs.AR · 2026-04-19 · unverdicted · novelty 5.0

A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

cs.CL · 2026-04-03 · unverdicted · novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

cs.LG · 2026-03-22 · unverdicted · novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

cs.DC · 2025-08-21 · unverdicted · novelty 5.0

HFX jointly designs scheduling and scaling for multi-SLO LLM serving, achieving up to 4.44x higher SLO attainment, 65.82% lower latency, and 49.81% lower cost than prior systems on multi-task workloads.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding cs.AR · 2026-04-27 · unverdicted · none · ref 39
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs cs.AR · 2026-04-17 · unverdicted · none · ref 45
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project cs.LG · 2026-03-22 · unverdicted · none · ref 64
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

Mooncake: A kvcache-centric disaggregated architecture for llm serving

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer