Title resolution pending

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, Christopher Ré

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

method 2

citation-polarity summary

use method 2

representative citing papers

Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

cs.PL · 2026-04-16 · unverdicted · novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.

Efficient Memory Management for Large Language Model Serving with PagedAttention

cs.LG · 2023-09-12 · conditional · novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

cs.DC · 2026-05-16 · unverdicted · novelty 6.0

ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

cs.DC · 2026-04-03 · unverdicted · novelty 6.0

TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

cs.DC · 2025-08-21 · unverdicted · novelty 5.0

TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.

citing papers explorer

Showing 5 of 5 citing papers.

Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels cs.PL · 2026-04-16 · unverdicted · none · ref 10
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
Efficient Memory Management for Large Language Model Serving with PagedAttention cs.LG · 2023-09-12 · conditional · none · ref 13
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse cs.DC · 2026-05-16 · unverdicted · none · ref 13
ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing cs.DC · 2026-04-03 · unverdicted · none · ref 3
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind cs.DC · 2025-08-21 · unverdicted · none · ref 12
TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer